Systematic Review on Learning-based Spectral CT

Spectral computed tomography (CT) has recently emerged as an advanced version of medical CT and significantly improves conventional (single-energy) CT. Spectral CT has two main forms: dual-energy computed tomography (DECT) and photon-counting computed tomography (PCCT), which offer image improvement, material decomposition, and feature quantification relative to conventional CT. However, the inherent challenges of spectral CT, evidenced by data and image artifacts, remain a bottleneck for clinical applications. To address these problems, machine learning techniques have been widely applied to spectral CT. In this review, we present the state-of-the-art data-driven techniques for spectral CT.


I. INTRODUCTION
S INCE Cormack and Hounsfield's Nobel prize-winning breakthrough, X-ray CT is extensively used in medical applications and produces a huge number of gray-scale CT images.However, these images are often insufficient to distinguish crucial differences between biological tissues and contrast agents.From the perspective of physics, the Xray spectrum from a medical device is polychromatic, and interactions between X-rays and biological tissues depend on the X-ray energy, which suggests the feasibility to obtain spectral, multi-energy, or true-color, CT images.
Over the past decade, spectral CT has been rapidly developed as a new generation of CT technology.DECT and PCCT are the two main forms of spectral CT.DECT is a method of acquiring two projection datasets at different energy levels.PCCT, on the other hand, uses detectors that measure individual photons and their energy, promising significantly better performance with major improvements in energy resolution, spatial resolution and dose efficiency [1], [2].Despite the intrinsic merits of spectral CT, there are technical challenges already, being or yet to be addressed [3], [4].To meet these challenges, the solutions can be hardware-oriented, softwareoriented, or hybrid.
Traditionally, CT algorithms are grouped into two categories, which are analytic and iterative reconstruction respectively.A new category of CT algorithms has recently emerged: artificial intelligence (AI)-inspired, learning-based or data-driven reconstruction.These algorithms are commonly implemented as deep neural networks (NNs), which are iteratively trained for image reconstruction and post-processing, and then used for inference in the feed-forward fashion just like a closed-form solution.
Several reviews have been dedicated to machine learning and deep learning in CT.These papers cover a wide range of topics, including image reconstruction, segmentation, classification, and more.For example, Litjens et al. [5] and Sahiner et al. [6] comprehensively surveyed deep learning applications in medical imaging.Domingues et al. [7] proposed a review on deep learning in CT and positron emission tomography (PET).However, few have specifically focused on spectral CT.
This review paper provides a technical overview of the current state-of-the-art of machine learning techniques for spectral CT, especially deep learning ones.The paper is divided into the following sections: DECT and PCCT systems, image reconstruction, material decomposition, pre-and post-processing, hybrid imaging, perspectives and conclusion.Section II describes DECT and PCCT systems.Section III discusses the application of learning-based techniques for multienergy CT reconstruction from energy-binned data, which use shallow or deep network architectures, from dictionary learning (DL) to much deeper contemporary network models.Reconstruction of multi-energy CT images will face the problem of beam hardening.Section IV covers different approaches to material decomposition: image-based techniques, which use as input multi-energy CT, and alternative solutions to beam hardening, projections-based and one-step decompositions.Section V is dedicated to various pre-processing and postprocessing aspects, which are based on sinogram data and spectral CT images respectively, including data calibration, image denoising and artifacts correction, as well as image generation.Finally, Section VI covers key issues and future directions of learning-based spectral CT.The structure of this paper is outlined in Fig. 1.

Notations
Vectors (resp.matrices) are represented with bold lowercase (resp.uppercase characters).Images are represented as Jdimensional real-valued vectors which can be reshaped in 2dimensional (2-D) or 3-dimensional (3-D) objects, where J is the number of image pixels or voxels.I is the number of rays per energy bin.' ⊤ ' is the matrix transposition symbol.A NN is represented by a bold calligraphic upper case character with a subscript representing the weights to be trained, e.g., F θ .∥•∥ 0 is the ℓ 0 semi-norm defined for all x = [x 1 , . . ., x N ] ⊤ ∈ R N as ∥x∥ 0 = #{n ∈ {1, . . ., N } : x n ̸ = 0}, where #A denotes the cardinal of set A, and ∥ • ∥ p , p > 1 the ℓ p -norm.For a positive-definite matrix M ∈ R N ×N , ∥ • ∥ M is the ℓ 2 weighted-norm defined for all x ∈ R N as ∥x∥ M = √ x ⊤ M x, and ∥ • ∥ F denotes the Frobenius norm.

II. DECT AND PCCT SYSTEMS
The first attempt to differentiate materials using CT with multiple X-ray energy spectra was made in the 1970s [8].Since then, technologies in spectral CT have been continuously evolving.Traditional DECT and spectrally-resolving PCCT are the two specific forms of spectral CT that are both commercially available.The former uses a minimum of two separate X-ray energy spectra to differentiate two basis materials with different attenuation properties at various energy levels, while the latter usually involves the advanced detector technology known as energy resolving photon-counting detectors (PCD), which resolves spectral information of X-ray photons in two or more energy bins emitted from a polychromatic X-ray source.DECT overcomes several limitations of single energy spectrum CT and has achieved clinical acceptance and widespread applications.In the following, several types of DECT are briefly described.We will not cover all technologies, but we will focus on those that are currently representative.The interested readers may refer to [9]- [18] for more details and comparisons.
Sequential acquisition is perhaps the most straightforward DECT imaging approach.It performs two consecutive or subsequent scans of the same anatomy using an X-ray source operated at a low-peak kilovoltage (kVp) setting and then a high-kVp setting.The approach requires no hardware modification, but may suffer from image mis-registration due to motion artifacts from the delay between low-and high-kVp scans.Advanced DECT technologies all utilize specific hardware to mitigate the misregistration problem and shorten the data acquisition time.
The dual-source DECT scanner was first introduced in 2005 [19], which is featured by two source-detector systems orthogonally arranged in the same gantry to acquire the lowand the high-energy scan simultaneously.Although the 90degree phase shift between the two scans creates a slight temporal offset, the two X-ray sources can select independent X-ray energy spectra to optimize the spectral separation for material differentiation in the data and/or image domains.
A dual-layer detector or a combination of two detector layers of scintillation material is also a good solution for DECT [20]- [23].In this approach, low-and high-energy datasets are collected simultaneously by the two detector layers with perfect spatial alignment and excellent synchronicity.This advantage simplifies direct data-domain material decomposition.
Fast kVp-switching DECT is yet another technology that uses a highly specialized X-ray generator that can rapidly switch the tube voltage between low-and high-kVp settings during data acquisition.The first commercially available fast kVp-switching DECT scanner (GE Discovery CT750 HD) is capable of changing the tube voltage for each projection angle, so that each low-and high-kVp projection can be obtained almost simultaneously.The material decomposition can then be performed in the data domain.A similar design Binned projection data

Imaging system
Section II

Reconstructed images
Image post-processing Sections V-B and V-C

Synergistic reconstruction
Section III Data pre-processing Section V-A

Projection-based material decomposition
Section IV-B

Image-based material decomposition
Section IV-A

Synergistic reconstruction
Section III

One-step material decomposition
Section IV-C Fig. 1.Structure of this review paper, with the sections keyed to the main steps in the spectral CT imaging process.
has been reported in [24] where the authors have utilized a linear accelerator as X-ray source to generate rapid switching electron pulses of 6 MeV and 9 MeV respectively.This has resulted in an experimental MeV DECT system that has been developed to perform cargo container inspection.Another type of fast kVp-switching DECT scanner has recently been introduced (Canon Aquilion ONE/PRISM) [25] that switches the tube voltage less frequently, allowing it to acquire the same energy from multiple successive projection angles.This design simplifies tube current modulation, making dose balancing at the two energy levels less complex.Along with the fast kVpswitching process, there is also a grating-based method that can help improve data acquisition [26].In this method, an X-ray filter that combines absorption and filtering gratings is placed between the source and the patient.The gratings move relative to each other and are synchronized with the tube switching process to avoid spectral correlation.Simulation studies have shown improved spectral information with reduced motion-induced artifacts.
PCD technology plays an important role in PCCT imaging.PCDs requires a single layer of semiconductor sensor that converts X-ray photons directly into electrical signals.The main converter materials at present are cadmium zinc telluride (CZT) and Si.CZT is a material with a higher atomic number Z than Si and has a relatively high X-ray stopping power.Thus, the CZT-based PCD can have thin sensor layers of only a few millimeters, whereas Si-based detector lengths must be long enough to ensure good X-ray absorption.In one example of Sibased detector, the Si wafers are mounted sideways or edge-on against incoming X-rays to form a deep Si strip detector [27].Therefore, building a full-area Si detector system can be more challenging.For imaging performance, both types of PCD have advantages and disadvantages in terms of signal quality as well as detection efficiency.More detailed comparisons can be found in [28], [29].
The innovation of PCD makes PCCT more attractive and offers unique advantages over conventional CT or DECT.These include improved dose efficiency by elimination of electronic noise, improved contrast-to-noise ratio (CNR) ratio through energy weighting [29]- [31], higher spatial resolution due to the small sub-millimeter PCD fabricated without any septa [29], [32], and most importantly, unprecedented material decomposition capabilities potentially for multi-tracer studies.Although PCCT is potentially more advantageous, it has to deal with technical challenges, including charge sharing and pile-up effects together with the need for substantial hardware and system research and development.Currently, the accessibility of PCCT for clinical applications is still limited.

III. MULTI-ENERGY IMAGE RECONSTRUCTION
Spectral CT, i.e., DECT and PCCT, offer the possibility to perform separate measurements, each measurement corresponding to an energy spectrum.One possibility is to reconstruct several attenuation CT images at different energies from these binned raw data.These images can then be used, e.g., for image-based material decomposition [33], [34] as illustrated in the top path of Fig. 1; more sophisticated method, in particular the one-step reconstruction of material images, will be discussed in Section IV.
The acquired projections usually suffer from low signal-tonoise ratio due to limited photons in each energy bin [35].Moreover, practical constraints such as a reduced scanning time restrict CT systems to have a limited number of views.Therefore, the development of specific multi-energy reconstruction algorithms is of major importance.This section reviews existing reconstruction algorithms for multi-energy CT reconstruction from energy-binned projection data, starting from conventional CT reconstruction algorithms to synergistic multi-energy CT reconstruction, with the incorporation of DL techniques and deep learning architectures.The methods presented here are only a subset of the literature in multichannel image reconstruction and we refer the readers to Arridge et al. [36] for an exhaustive review.

A. Forward and Inverse Problems
In this section, we briefly introduce a forward model that can be equally used for PCCT and DECT.We consider a standard discrete model used in model-based iterative reconstruction (MBIR).
The linear attenuation image takes the form of a spatiallyand energy-dependent function µ : R n × R + → R + , n = 2, 3, such that for all r ∈ R n and for all E ∈ R + , µ(r, E) is the linear attenuation at position r and energy E. Standard CT systems perform measurements along a collection of rays {L i } where L i ⊂ R n denotes the i th ray, i = 1, . . ., I, with I = N d × N s , N d and N s being respectively the number of detector pixels and the number of source positions.For all i = 1, . . ., I, the expected signal (e.g. the number of photons in PCCT) is given by the Beer-Lambert law as where ' Li ' denotes the line integral along L i , h i is the corresponding X-ray photon flux which accounts for the source spectrum and the detector sensitivity (times the energy with energy integrating detectors) and r i is the background term (e.g., scatter, dark current).
In multi-energy CT (e.g., PCCT and DECT), the measurements are regrouped into K energy bins (K = 2 for DECT and more for PCCT).For each bin k, the expected number of detected X-ray photons is where L i,k is the i th ray for bin k, h i,k is the photon flux X-ray intensity for bin k and r i,k is the background term.In PCCT each bin k corresponds to an interval may spillover the neighboring intervals.We assume that the number of detector pixels is equal to I for each energy bin k.
The forward model (2) applies to both PCCT and DECT.In PCCT, the detector records the deposited energy in each interaction and the energy binning is performed the same way for each ray so that L i,k is independent of the bin k.In contrast, DECT systems (except dual-layer detectors) perform 2 independent acquisitions with 2 different photon flux X-ray intensity h i,1 and h i,2 , possibly at different source locations (i.e., via rapid kVp switching) so that the rays generally depend on k.
One of the possible tasks in PCCT and DECT is to estimate a collection of K attenuation CT images, i.e., one image per each of the K binned measurements {y k }, The energy-dependent image to reconstruct is sampled on a grid of J pixels, assuming that µ can be decomposed on a basis of J "pixel-functions" u j such that where µ j (E) is the energy-dependent attenuation at pixel j.
The line integrals in Eq. ( 1) and Eq. ( 2) can be therefore rewritten as with + is the discretized energy-dependent attenuation, and we consider the following model which is an approximate version of Eq. ( 2) where The reconstruction of each µ k is achieved by "fitting" the expectation y k (µ k ) = [y 1,k (µ k ), . . ., y I,k (µ k )] ⊤ to the measurement y k , for example by solving the inverse problem with respect to µ k , where , is the vector of the approximated line integrals.This can be achieved by using an analytical method such as filtered backprojection (FBP) [37], or by using an iterative technique [38], [39].Unfortunately, the inverse problem ( 6) is ill-posed and direct inversion leads to noise amplification which is impractical for low-dose imaging.Moreover, the inversion relies on an idealized mathematical model that does not reflect the physics of the acquisition, especially by ignoring the polychromatic nature of the X-ray spectra.

B. Penalized Reconstruction
Alternatively, the reconstruction can be achieved for each energy bin k by finding an estimate µ k as the solution of an optimization problem of the form where L is a loss function (e.g., the Poisson negative loglikelihood for PCCT) that evaluates the goodness of fit between the data y k and y k (µ k ), β k > 0 is a weight and R k is a penalty function or regularizer, generally convex and nonnegative, that promotes desired image properties while controlling the noise.The data fidelity term in (7) is convex when r i,k = 0 for all i, k.Although many approaches were proposed to solve (7), most algorithms are somehow similar to the proximal gradient algorithm [40], [41], that is to say, given an image estimate µ (q) k at iteration q, the next estimate µ is obtained via a reconstruction step followed by a smoothing step, where g is the gradient of the data fidelity loss k and H k is a suitable diagonal positive-definite matrix (typically, a diagonal majorizer of the Hessian of the data fidelity loss).The first step (8) is a gradient descent that guarantees a decrease of the data fidelity while the second step ( 9) is an image denoising operation.This type of approach encompasses optimization transfer techniques such as separable quadratic surrogate (SQS) [42], [43].
The choice of R k depends on the desired image properties.A popular choice consists in penalizing differences in the values of neighboring pixels with a smooth edge-preserving potential function and solving Eq. ( 9) is achieved with standard smooth optimization tools [42], [43].Another popular choice is the compressed sensing (CS) approach, which has been widely used in medical imaging when using an undersampled measurement operator A k (e.g., sparse-view CT).CS consists of assuming that the signal to recover is sparse in some sense to recover it from far fewer samples than required by the Nyquist-Shannon sampling theorem.In the following paragraphs, we briefly discuss the synthesis and the analysis approaches.
In the synthesis approach, it is assumed that µ k = D k z k where D k ∈ R J×S is a dictionary matrix, i.e., an overcomplete basis, consisting of S atoms, and z k ∈ R S is a sparse vector of coefficients such that µ k is represented by a fraction of columns of D k , or atoms.The reconstruction of the image is then given by where ∥ • ∥ m can be either the ℓ 0 semi-norm or its convex relaxation, the ℓ1 norm, and α > 0 is a weight controlling the sparsity of z.The optimization can be achieved by orthogonal matching pursuit [44] for m = 0 and proximal gradient for m = 1.In some situations, imposing µ k = D k z k may be too restrictive and a relaxed constraint µ k ≈ D k z k is often preferred.The reconstruction is then achieved by penalized reconstruction using a regulariser R D k that prevents µ k from deviating from D k z k , usually defined as where α k > 0 is a weight.Solving Eq. ( 7) is achieved by alternating between minimization in µ k (e.g., by performing several iterations of ( 8) and ( 9)) and minimization in z k (e.g., orthogonal matching pursuit [44] for m = 0 and proximal gradient for m = 1).This type of penalty forms the basis of learned penalties that we will address in Section III-D.
In the analysis (encoding) approach, it is assumed that T k µ k is sparse, where T k ∈ R D×J is a sparsifying transform, and the penalty R k is For example, in image processing, T k can be a wavelet transform or finite differences (discrete gradient).In the latter case and when m = 1, the corresponding penalty R k is referred to as total variation (TV) 1 .TV has been extensively used in image processing for its ability to represent piecewise constant objects [51].Because R D k is non-smooth, solving Eq. ( 9) requires variable splitting techniques such as proximal gradient, alternating direction method of multipliers (ADMM) [52] or the Chambolle-Pock algorithm [53].

C. Synergistic Penalties
Alternatively, the images can be simultaneously reconstructed.Introducing µ = {µ k } the spectral CT multichannel image, y = {y k } the binned projection data and y(µ) = {y k (µ k )} the expected binned projections, the images can be simultaneously reconstructed as where R is a synergistic penalty function that promotes structural and/or functional dependencies between the multiple images and a proximal gradient algorithm to solve Eq. ( 13) at iteration q + 1 to update µ (q) = {µ where Eq. ( 15) corresponds to a synergistic smoothing step.The paradigm shift here is that allowing the channels to "talk to each other" can reduce the noise as each channel participates in the reconstruction of all the other ones.In the context of spectral CT, this suggests that the reconstruction of each image µ k benefits from the entire measurement data y.Here, we present a non-exhaustive list of existing approaches.One class of approaches consists of enforcing structural similarities between the K channels.Examples include joint total variation (JTV) which encourages gradient-sparse solutions (in the same way as the conventional TV) and also encourages joint sparsity of the gradients [54], [55].Total nuclear variation encourages common edge locations and a shared gradient direction among image channels [56], [57].All these works reported improved image quality with synergistic image processing as compared with single-image processing.
A second class of approaches consists of promoting similarities across channels by controlling the rank of the multichannel image.Given that the energy dependence of human tissues can be represented by the linear combination of two materials only (see Section IV), it is natural to expect a low rank in some sense in the spectral dimension.For dynamic CT imaging, Gao et al. [58] proposed a method, namely Robust Principle Component Analysis based 4-D CT (RPCA-4DCT), based on a low-rank (LR) + sparse decomposition of the multichannel image matrix M = [µ 1 , . . ., µ K ] ∈ R J×K (K time frames), where M l is an LR matrix representing the information that is repeated across the channels and M s is a sparse matrix representing the variations in the form of outliers, and a synergistic penalty defined as γ > 0 and the nuclear norm ∥•∥ * is a relaxation of the rank of a matrix, and showed that their approach outperforms TV-based (in both spatial and temporal dimensions) regularization.Gao et al. [59] then generalized this method for spectral CT with the prior rank intensity and sparsity model, which uses the rank of a tight-frame transform of the LR matrix to better characterize the multi-level and multi-filtered image coherence across the energy spectrum, in combination with energy-dependent intensity information, and showed their method outperformed conventional LR + sparse decomposition.This principle was further generalized by "folding" the multichannel image J×K (for 2-D imaging) and applying the generalized tensor nuclear norm regularizer to exploit structural redundancies across spatial dimensions (in addition to the spectral dimension) [60]- [65].
A third and different class of approaches consists of enforcing structural similarities of each µ k with a reference low-noise high-resolution image µ, generally taken as the reconstruction from all combined energy bins.Instead of using a joint penalty R, each channel is controlled by a penalty R k of the form where S is a "similarity measure" between µ k and the reference image µ.The prior image-constrained compressed sensing (PICCS) [66], [67] approach uses S(µ k , µ) = ∥∇(µ k − µ)∥ m , ∇ denoting the discrete gradient; the ℓ 1 -norm can also be replaced with the ℓ 0 semi-norm [68].Variants of this approach include nonlocal similarity measures [69], [70] to preserve both high-and low-frequency components.More recently, Cueva et al. [71] proposed the directional TV approach for spectral CT, which enforces colinearity between the gradients of µ k and µ, while preserving sparsity, and showed their approach outperforms TV.To conclude, spectral CT reconstruction with synergistic penalties has been widely used to improve the quality of the reconstructed images.However, the success of this approach heavily depends on the selection of an appropriate synergistic penalty term, which is typically fixed and may not always accurately reflect the true underlying structure of the data.

D. Learned Penalties
Traditional regularization methods, such as those described in Sections III-B and III-C, impose a fixed handcrafted penalty on the reconstructed image based on certain assumptions about its structure, such as sparsity or smoothness.However, these assumptions may not always hold in practice, leading to suboptimal reconstructions.Learned penalty functions, on the other hand, can adaptively adjust the penalty term based on the specific characteristics of the data, allowing for more accurate and flexible reconstruction.
This subsection discusses learned synergistic penalties for multichannel image reconstruction.In particular, we will focus on penalties based on a generator G, which is a trained mapping that takes as input a latent variable z, which can be an image or a code, and returns a plausible multichannel image G(z) = {F k (z)}.The latent variable z represents the patient which connects the different channels.The penalty function plays the role of a discriminator by promoting images originating from the generative model and by penalizing images that deviate from it, in a similar fashion to the relaxed synthesis model (11).
Most of this subsection will address DL, i.e., F k (z) = D k z for some dictionary matrix D k , as it is the most prevalent learned penalty used in synergistic multichannel image reconstruction.Convolutional dictionary learning (CDL) will also be discussed in a short paragraph.Finally, we will discuss recent work that uses deep NN models.
In this subsection µ tr = {µ tr k } denotes a random spectral CT image whose joint distribution corresponds to the empirical distribution derived from a training dataset of L spectral CT images µ tr, [1] , . . ., µ tr,[L] ∈ R J K , that is to say for all mapping h : R J K → R, 1) Dictionary Learning: For simplicity this section will consider 2-D imaging (i.e., n = 2), so that each image µ k ∈ R J can be reshaped into a √ J × √ J square matrix.DL is a popular technique for regularizing the reconstruction process in medical imaging and especially in CT reconstruction [72]- [75].The basic idea behind DL is to learn a dictionary matrix that can represent the image with a fraction of its columns.The dictionary operator requires a large number of atoms to accurately represent all possible images which increase the computational complexity of training.Therefore, to reduce the complexity, the image is generally split into P smaller d-dimensional "patches" (possibly overlapping) with d ≪ J.For a given energy bin k, the trained penalty to reconstruct a single attenuation image µ k by penalized reconstruction (7) is given by where D ⋆ k ∈ R d×S is the trained dictionary matrix, P p ∈ R d×J is the p th patch extractor and each z p is the sparse vector of coefficients to represent the p th patch with D ⋆ k .The training is generally performed by minimizing R D k with respect to D k (with unit ℓ2 -norm constraints on its columns) over a training data set of high-quality images, for example using the K-SVD algorithm introduced by Aharon et al. [76].DL can also be used to represent images synergistically.Tensor dictionary learning (TDL) consists in folding the spectral images J×K and in training a spatio-spectral tensor dictionary to sparsely represent M with a sparse core tensor Z ∈ R s1×s2×s3 , such that each atom conveys information across the spectral dimension.A common approach used to sparsely represent the sensor image M is to use the Tucker decomposition [77], [78].It was utilized in multispectral image denoising [79], [80] as well as in dynamic CT [81] (by replacing the spectral dimension by the temporal dimension).Denoting P p : R d×K the p th spatiospectral image patch extractor, each patch P p (M) can be approximated by the Tucker decomposition as where Z p ∈ R s1×s2×s3 is the core tensor for the p th patch, are the 2-D spatial dictionaries along each dimension and D (3) ∈ R K×s3 is the spectral dictionary (all of them consisting of orthogonal unit column vectors), and × n is the mode-n tensor/matrix product (see for example Semerci et al. [61] for a definition of tensormatrix product).
The Tucker decomposition requires a large number of atoms and therefore is cumbersome for DL in high dimensions.To remedy this, Zhang et al. [82] proposed to use the canonical polyadic decomposition (CPD), which consists of assuming that the core tensor Z is diagonal, i.e., s 1 = s 2 = s 3 = S and (Z) a,b,c ̸ = 0 =⇒ a = b = c, which leads to the following approximation [78], where for all s, D s = d R S is a sparse vector corresponding to the diagonal of Z p and '⊗' denotes the matrix outer product.Zhang et al. then used this decomposition to train spatiospectral dictionaries combined with a K-CPD algorithm [83] from which the following penalty term is derived 2 : with The training is performed as where M tr is the spatio-spectral tensor obtained by folding the n th training multichannel image matrix [µ tr 1 , . . ., µ tr K ], and the minimization is performed subject to the constraint s .Wu et al. [84] proposed a similar approach with the addition of the ℓ 0 semi-norm of the gradient images at each energy bin in order to enforce piecewise smoothness of the images, while Li et al. [85] added a PICCS-like penalty (18) to enforce joint sparsity of the gradients.
We can observe that the TDL regularizer with CPD can be rewritten as reshaped into a vector.This regularizer is a generalization of (20) to multichannel imaging with a collection of dictionaries {D ⋆ k } and a unique sparse code {z p } for all energy bins k.Similar representations were used in coupled DL in multimodal imaging synergistic reconstruction, such as in PET/magnetic resonance imaging (MRI) [86], [87], multi-contrast MRI [88] as well as super-resolution [89].
Patch-based DL may be inefficient as the atoms are shiftvariant and may produce atoms that are shifted versions of each other.Moreover, using many neighboring/overlapping patches across the training images is not efficient in terms of sparse representation as sparsification is performed on each patch separately.Instead, CDL [90]- [92] consists in utilizing a trained dictionary of image filters to represent the image as a linear combination of sparse feature images convolved with the filters (synthesis model) that can be used in a penalty function similar to Eq. (20), without patch extraction.Bao et al. [93] used this approach for CT MBIR.Alternatively, convolutional analysis operator learning (CAOL) consists in training sparsifying convolutions, which can then be used as a penalty function for MBIR [94].There are a few applications of CDL and CAOL in multichannel imaging and multi-energy CT (see [95] for a review).Degraux et al. [96] proposed a multichannel CDL model to represent two images simultaneously (intensitydepth imaging), using a collection of pairs of image filters.Gao et al. [97] proposed a more general model with common and unique filters.More recently, Perelli et al. [98] proposed a multichannel CAOL for DECT joint reconstruction, which uses pairs of image filters to jointly sparsify the low-and highenergy images, and demonstrated their method outperforms JTV-based synergistic reconstruction.
2) Deep-Learned Penalties: The synthesis model used in DL can be generalized by replacing the multichannel dictionaries Unlike dictionary learning, which uses a finite number of atoms to represent the data, deep NNs can learn parameters that can capture more intricate patterns and structures in the image data.A synergistic regularizer used in Eq. ( 7) can then be defined as where H is a penalty function for z (not necessarily sparsitypromoting), which is the generalization of multichannel DL (26) using multiple NNs.Wang et al. [99] used this approach with a collection of U-nets F θ k trained in a supervised way to map the attenuation image at the lowest energy bin µ 1 to the attenuation image at energy bin k, i.e., and combined a standard Huber penalty (the H function in Eq. ( 27)) for z.The trained penalty R θ ⋆ k "connects" the channels by a spectral image {µ k } such that each µ k originates from a single image z that is smooth in the sense of H. Wang et al. reported substantial noise reduction as compared with individually reconstructed images and JTV synergistic reconstruction.
The training of the generative model can also be unsupervised, for example as a multichannel auto-encoder (AE), i.e., where E ϕ : R J K → Z is a multichannel encoder, i.e., that encodes a collection of images into a single latent vector, parametrized with ϕ.In this approach, µ k is encouraged not to deviate from the "manifold" of plausible images {F θ ⋆ k (z), z ∈ Z}.Pinton et al. [100] and Gautier et al. [101] used this approach respectively for PET/CT and PET/MRI using a multi-branch variational AE, and reported considerable noise reduction by reconstructing the images synergistically as opposed to reconstructing the images individually.A patchedbased version of this penalty with a K-Sparse AE (i.e., with H = ∥ • ∥ 0 ) was proposed by Wu et al. [102] for singlechannel CT.Duff et al. [103] proposed a similar approach with a Wasserstein generative adversarial network (W-GAN).
An alternative approach, namely the deep image prior introduced by Ulyanov et al. [104], consist of fixing the input z and to optimize with respect to θ, in such a way that the reconstruction does not require pre-training of the NN.A multichannel version of this approach using a multi-branch NN with a single input z was proposed for DECT [105].
Although deep-learned penalties have been successfully applied in image reconstruction, their application to spectral CT has been relatively limited and remains an active area of research.Future work should focus on developing more efficient and accurate deep-learned penalties that are specifically tailored to the unique challenges and opportunities of spectral CT.

E. Deep Learning-based Reconstruction
Another paradigm shift has been the development of end-toend learning architectures that directly map the raw projection data to the reconstructed images.This approach, known as learned reconstruction, has two main categories: direct reconstruction and unrolling techniques.Direct reconstruction involves training a single NN to perform the reconstruction task, while unrolling techniques aim to mimic the iterative algorithm by "unrolling" its iterations into layers.These techniques have shown great potential in image reconstruction, where the acquisition of data at different energy levels provides additional information about the material composition of the imaged object.In this section, we review recent advances of unrolling-based architectures for image reconstruction and their extension to synergistic spectral CT reconstruction.Direct methods have not yet been deployed for spectral CT and will be discussed in Section VI.
In the following (µ tr , y tr ) ∈ R J K × R I K denotes a random spectral CT image/binned projections pair whose joint distribution corresponds to the empirical distribution derived from L training pairs µ tr, [1] , y tr, [1] , . . ., µ tr, [L] , y tr,[L] ∈ R J L × R I L such that for all ℓ = 1, . . ., L the spectral CT multichannel image µ tr, [ℓ] is reconstructed from y tr, [ℓ] .
Unrolling techniques, or learned iterative schemes, have become increasingly popular for image reconstruction in recent years, due to their ability to leverage the flexibility and scalability of deep neural networks while retaining the interpretability and adaptability of classical iterative methods.Unrolling-based techniques aim at finding a deep architecture that approximates an iterative algorithm.
For all energy bins k, the (q + 1) th iteration of an algorithm to reconstruct the image µ k can be written as where L k θ q,k is an image-to-image mapping that intrinsically depends on y k and that updates the image at layer q to layer q + 1.The parameter θ q,k typically comprises algorithm hyperparameters such as step lengths and penalty weights but also NN weights.For example, Eq. ( 8) and Eq. ( 9) are unrolled with L k θ q,k µ where µ (0) k is a given initial image and the right-hand side depends on y k by means of of L k θ q,k , and the trained parameter θ ⋆ k is obtained by supervised training as Alternative to Eq. ( 29) and (30), for example incorporating memory from previous iterates at each layer, can be found in Arridge et al. [106].By utilizing components of iterative algorithms such as the backprojector A ⊤ k , unrolling-based architectures can map projection data to images without suffering from scaling issues.Many works from the literature derived unrolling architecture from existing model-based algorithms and we will only cite a non-exhaustive list; we refer the reader to Monga et al. [107] for a review of unrolling techniques until 2021.One of the first unrolling architectures, namely ADMM-net, was proposed by Yang et al. [108] for CS MRI and consists in a modified ADMM algorithm [52] where basics operation (finite-difference operator, softthresholding, etc.) are replaced by transformations such as convolution layers with parameters that are trained end-to-end.Other works rapidly followed for regularized inverse problems in general and image reconstruction in particular.Learned proximal operators, which consist of replacing the update (9) with a trainable convolutional neural network (CNN) [109], [110].In a similar fashion, Chun et al., proposed BCD-Net [111] and its accelerated version Momentum-Net [112] which consists in unrolling a variable-splitting algorithm and replace the image regularization step with a CNN.Adler et al. [113] proposed a trainable unrolled version of the primaldual (Chambolle-Pock) algorithm [53].
A synergistic reconstruction algorithm such as given by Eq. ( 14) and Eq. ( 15) may also be unrolled in a trainable deep multi-branch architecture by merging the mappings L k θ q,k at each layer q into a single multichannel mapping L Θq : R J K → R J K that depends on the entire binned projection dataset y = {y k } and on some parameter Θ q .The update from layer q to layer q + 1 is given by where the mapping L Θq utilizes the entire data and updates the images simultaneously, thus allowing the information to pass between channels.For example, the layer corresponding to Eq. ( 14) and Eq. ( 15) is for some initialization µ (0) , and the trained parameter Θ ⋆ = {Θ ⋆ q } is obtained by supervised training similar to Eq. (31) but using the data at all energy bins simultaneously: A simplified representation of this architecture is shown in Fig. 2.
At the time we are writing this paper, very few research addressed synergistic reconstruction using unrolling-based architectures.We can cite the recent work SOUL-Net by Chen et al. [114] which proposes an ADMM-based architecture to solve the joint problem (13) with the nuclear norm (for LR penalty, cf.Section III-C) and TV.Chen et al. modified the singular value thresholding step for nuclear norm minimization by adding a ReLu function with trainable parameters, and replaced the TV minimization with a CNN combined with an attention-based network.They showed that their method outperforms "conventional" LR + sparse decomposition methods.
Unrolling techniques have shown great promise as a flexible and powerful tool for single-channel image reconstruction.Although these techniques have been applied successfully to a variety of imaging modalities, their application to multichannel synergistic reconstruction in spectral CT remains relatively limited and challenging, due to the high-dimensional nature of the data and the need for accurate modeling of the spectral correlations.However, unrolling techniques have been proposed for projection-based and one-step material decomposition, see Section IV.

IV. MATERIAL DECOMPOSITION
Spectral CT techniques such as DECT and PCCT are often used to characterize the materials of the scanned patient or object by decomposing the linear attenuation coefficient into material images.This process of material decomposition is based on the assumption that the energy dependence of the linear attenuation coefficient in each pixel can be expressed as a linear combination of a small number M of basis functions [115].The linear attenuation µ(r, E) can then be modeled as where f m represents the m th energy-dependent basis function and x m is the m th material image.These basis functions describe physical effects such as photoelectric absorption and Compton scattering [115] or the linear attenuation coefficients of representative materials of the scanned object such as water and bone for patients.With this model, two basis functions are sufficient to describe the variations of the linear attenuation coefficients of human tissues with energy [116]- [118].One or more basis function(s) may also be used to represent a specific contrast agent, e.g., a material with a K-edge discontinuity in its attenuation coefficient in the range of diagnostic energies (30-140 keV) [119].The material images x m can be represented in the discrete domain as a vector using the pixel basis functions u j (r) (see Eq. ( 3)) with each pixel of the unknown image decomposed into the chosen material basis.The discrete object model for the basis decomposition is then where x j,m is the weight of the m th basis function in the j th pixel.Injecting (36) into (2) links the material decomposition to the expected value (e.g. the number of detected X-ray photons for PCCT) This problem is the combination of two sub-problems: tomographic reconstruction and spectral unmixing.The two problems can be solved sequentially or jointly and most techniques of the literature fall into one of the following categories: image-based, projection-based or one-step material decomposition.

A. Image-based Material Decomposition
Image-based algorithms decompose the multichannel CT image µ = {µ k } into material images x m .While each channel µ k is often obtained by direct methods such as FBP, an alternative procedure is the reconstruction of each channel µ k from y k by solving the MBIR problem in Eq. ( 7) or the joint reconstruction of µ = {µ k } from y = {y k } by solving the synergistic MBIR problem in Eq. ( 13).The discretized version of Eq. ( 36) is with F k,m ≃ f m (E k ) and E k the energy of the attenuation image µ k .The images may be decomposed by solving in each pixel the linear inverse problem    µ j,1 . . . where , is the same matrix for all voxels characterizing the image-based decomposition problem.It is generally calibrated with spectral CT images of objects of known attenuation coefficients.Given that K and M are small, the pseudo-inverse of F can be easily computed and applied quickly after the tomographic reconstruction of µ.Image-based material decomposition faces two challenges: (1) the spectral CT images are affected by higher noise than conventional CT (if the same total dose is split across energy bins) which will be enhanced by the poor conditioning of F and (2) the spectral CT images will suffer from beamhardening artifacts since the efficient spectra h i,k are not truly monochromatic in most cases, i.e., F is actually voxel and object dependent.Machine learning algorithms have been used for imagebased decomposition to mitigate noise and beam-hardening artifacts.Some techniques learn an adequate regularization [120]- [125] while using the linear model in Eq. (39).These techniques are similar in essence to those described in Section III-D1 except that dictionary learning uses decomposed images for spatially regularizing the decomposed images.
NNs may be used instead to improve the linear model in Eq. ( 39) [126].As in many other fields of research on image processing, deep CNNs have demonstrated their ability to solve image-based decomposition with a more satisfactory solution than the one produced by a pixel-by-pixel approach.Several deep learning architectures, previously designed to solve other image processing tasks, have been deployed for image-based decomposition.Most works are based on a supervised learning approach where a dataset of manually segmented basis material images are available: fully convolutional network [127], U-Net [128]- [133], Butterfly-Net [134], visual geometry group [132], [135], Incept-net [136], [137], generative adversarial network (GAN) [138], Dense-net [139].These contributions differ on the type of architecture adopted and the complexity of the network which is measured by the number of trainable parameters.They also differ in which inputs are used by the network, e.g., reconstructed multichannel CT images µ [133] or pre-decomposed CT images [131].The network output is generally the decomposed CT images x m but it may also be other images, e.g., the elemental composition [132], quantities used for radiotherapy planning such as the image of the electron density [140] or the virtual non-calcium image [137].

B. Projection-based Material Decomposition
The main limitation of image-based approaches is that the input multichannel CT image µ is generally flawed by beam hardening.If several energy measurements are available for the same ray (A k = A for all k), with a dual-layer DECT or a PCCT, an alternative approach is projection-based decomposition [115], [119] which aims at estimating projections a i,m , i = 1, . . ., I, m = 1, . . ., M , of the decomposed CT images x m , from the measurements y k given the forward model where a i,: = [a i,1 , . . ., a i,M ] ⊤ and a :,m = [a 1,m , ..., a I,m ] ⊤ .In this context, the expected value y k becomes a function of a = {a i,: } (or = {a :,m }) instead of x.Given the decomposed projections a :,m , the images x m are obtained by solving the following inverse problem where multichannel reconstruction algorithm, e.g.those described in Sections III-B and III-C can be deployed to reconstruct x from a. Similar to image-based decomposition, projection-based decomposition can be solved pixel by pixel in the projection domain by solving a i,: ∈ arg min ai,:∈R M + L (y, y(a i,: )) + βR(a i,: ). ( The number of inputs and unknowns is the same for each projection pixel, but it is more complex because the exponential in Eq. ( 41) induces a non-linear relationship between y i,k and a i,: .Moreover, this inverse problem ( 43) is non-convex [141] (unless, obviously, if the exponential is linearized) and fullyconnected NNs have been used to solve it [142], [143].Such networks can also be used to process input data for spectral distortions before material decomposition [144] or to modify the model described by Eq. ( 41) to account for pixel-to-pixel variations [145] or pulse pile-up [146].However, these approaches cannot reduce noise compared to conventional estimation of most likely solutions [119] without accounting for spatial variations.The idea of spatially regularizing pixel-based material decomposition has first been investigated with variational approaches [147], [148] As in image-based algorithms, DL [149], [150] has been investigated to improve the spatial regularization as well as CNNs to learn features of the projections with U-Net [129], [130], ResUnet [151], stacked auto-encoder [152], perceptron [153], GAN [154] and ensemble learning [155], [156].
A promising alternative to these supervised techniques, which are learning the physical model from the data, is to solve (44) by combining iterative reconstruction with learning algorithms in so-called learned gradient-descent using unrolling algorithms [157] detailed in Section III-E.Other approaches such as proposed by Zhang et al. [158] combine multiple NNs both for learning the material decomposition in the projection domain with an additional refinement network in the image domain to enhance the reconstructed image quality.

C. One-step Material Decomposition
One limitation of projection-based decomposition is that some statistical information is lost in decomposed projections a which could be useful to reconstruct the most likely material maps x.The noise correlations between the decomposed sinograms a may be accounted for in the subsequent tomographic reconstruction [159], [160] but it cannot fully characterize the noise of the measurements y, in particular with more than two energy bins (K > 2).Several groups have investigated an alternative solution combining material decomposition and tomographic reconstruction in a one-step algorithm which reconstructs the material maps x from the measurements y by solving the optimization problem Compared to Eq. ( 7), solving ( 45) is a far more difficult problem, similar to projection-based algorithms but with a larger number of unknowns (J × M ) and inputs (I × K).Several iterative solutions have been proposed to address this problem by optimizing the most likely material maps x given the measurements y with spatial regularization.One of the main differences between these algorithms is the optimization algorithm, from non-linear conjugate gradient [161] to SQS algorithms [162]- [164] and primal-dual algorithms [165], [166].The nature of this problem is such that all algorithms based on machine learning have used part of the physical model in their architecture.Generally, combining physics knowledge and deep learning for material decomposition is implemented through unrolling methods [167] (Section III-E).Eguizabal et al. [168] adapted the projection-based unrolling algorithm of [157] to one-step reconstruction.The same group has used machine learning to improve the physical model in Eq. ( 37) by modeling charge sharing [169].Another approach is to insert a backprojection step into the network architecture, i.e. the adjoint of the line integral operator in Eq. ( 37), to account for this knowledge in the network architecture [170], [171].Finally, machine learning may be used at each iteration for denoising the images, e.g. with a dictionary approach [172].A self-supervised approach named Noise2Noise prior [173], which does not require manually segmented ground truth materials images, has been applied to one-step decomposition using a training dataset consisting of sinograms paired with their noisy counterpart obtained by sinogram splitting.
The different approaches for material decomposition differ on many levels, from computational cost to the accuracy of the decomposed images.For example, Abascal et al. [129] compared projection-based and image-based algorithms using variational approaches and machine learning.They observed the best image quality with an image-based material decomposition approach, as illustrated in Fig. 3.However, the recent Grand Challenge on Deep-Learning spectral Computed Tomography [174] demonstrated that many different approaches are still under investigation.Nine out of the ten best scorers used machine learning and most combined it with a model of the DECT acquisition.The development of such algorithms in clinical scanners will depend on both their practicality, e.g. the computational time, and the accuracy of the material decomposition of real data.

V. DATA PRE-PROCESSING AND IMAGE POST-PROCESSING
CT technology has been the front-line imaging tool in emergency rooms due to its fast, non-invasive, and highresolution features, with millions of scans performed annually worldwide.However, due to the increased cancer incidence from radiation exposure, "as low as reasonably achievable" is the central principle to follow in radiology practice.Recent advances in CT technology and deep learning techniques have led to great developments in reducing radiation doses in CT scans [175].For example, aided by deep learning techniques, much progress has been made in low-dose or few-view CT reconstruction without sacrificing significant image quality.Furthermore, the use of DECT technology allows further cuts in radiation dose by replacing previous non-contrast CT scans with virtual unenhanced images in clinical practice [176].
While many prior-regularized iterative reconstruction techniques described in Section III inherently suppress noise and artifact, network-based post-processing techniques are also popular for removing noise and artifacts from already reconstructed low-dose spectral images and are covered here.Moreover, PCCT with PCDs is widely viewed as a comprehensive upgrade to DECT since it produces less noise, better spectral separation, and higher spatial resolution while requiring less radiation dose [29], [30].However, the PCD often experiences increased nonuniformity and spectral distortion due to chargesharing and pulse pile-up effects compared to the traditional energy-integrating detectors (EID), and the correction of these imperfections in PCD images is included here.Finally, we also review deep learning techniques that enhance clinical diagnosis with spectral CT, which includes virtual monoenergetic image synthesis, virtual noncontrast image generation, iodine dose reduction, virtual calcium suppression, and other applications.The overview of this section is summarized in Fig. 4.

A. PCCT Data Pre-processing
PCDs offer much smaller pixel size compared to EIDs and also possess energy discrimination ability that can greatly enhance CT imaging with significantly higher spatial and spectral resolution.However, PCD measurements are often distorted by undesired charge sharing and pulse pileup effects, which can limit the accuracy of attenuation values and material decomposition.Since accurately modeling these effects is highly complex, deep learning methods are being actively explored for distortion correction in a data driven manner.The initial trial is introduced in Touch et al. [144] where a simple fully-connected NN with two hidden layers of five neurons each was adopted mainly for charge sharing correction.Later the same network structure but with more neurons was used by Feng et al. [177] to compensate pulse pileup distortion, and similarly in [178], [179] for spectral distortion correction.A large CNN model was first introduced in Li et al. [180] to leverage inter-pixel information for both corrections of charge sharing and pulse pileup effects.The model included a dedicated generator with a pixel-wise fully-connected subnetwork for intra-pixel distortion caused by pulse pileup and a convolutional sub-network for inter-pixel cross-talk correction, and was trained using the W-GAN framework for spectral correction.More recently, Holbrook et al. [181] used multienergy CT scans with an EID to calibrate the PCD spectral distortion, and adopted a U-Net to map the distorted PCD projections into monochromatic projections generated by multienergy CT scans after material decomposition.Ma et al. [182] introduced CNN-LSTM to correct pulse pileup distortion in Xray source spectrum measurements, while Smith et al. [183] used a spatial-temporal CNN for charge sharing compensation.
There are also several interesting studies on artifact correction for PCCT using deep learning methods.Erath et al.Net with the perceptual loss for the correction of ring artifacts caused by pixel nonuniformity [186], while Fang et al. [187] used two U-Nets in both projection domain and image domain for ring artifacts removal.

B. Image Post-processing 1) Image Denoising:
In CT imaging, it is important to limit the radiation dose to patients, but reducing the dose often gives rise to image noise, which can strain radiologists' interpretation.To address this issue, various image denoising methods have been developed that aim to recover a clean version µ ⋆ from a noisy image µ 0 by leveraging prior knowledge R of the image to maintain sufficient image quality for clinical evaluation, The development of CT noise reduction techniques has a long history with its root dating back shortly after the invention of CT.While our focus is on deep learning and spectral CT, it is important to briefly cover classic post-processing denoising techniques and deep learning techniques for single energy CT, as they can still be applied to spectral CT in a channelby-channel manner.We will then dive into recent trends of self-supervised learning deep denoising methods, as well as deep methods that incorporate the correlations between energy channels.Spatial filtering methods leverage the statistical nature of noise fluctuations and are achieved through local averaging or nonlocal averaging means [188]- [190]; optimization-based denoising methods, on the other hand, incorporate image model preassumptions such as domain sparsity, piecewise linearity, or gradient smoothness as regularization.Some wellknown methods in this category include TV [191], DL [72], [192], wavelet based denoising [193], block-matching and 3-D filtering (BM3D) [194], and others.A good discussion of these classic denoising techniques is provided by Diwakar et al. in their review paper [195].Different from the explicitly defined prior knowledge in traditional methods, the development of deep learning techniques, particularly CNNs, provides a datadriven approach to learn the implicit distribution knowledge from large amounts of images, offering a one-step solution to the denoising problem (Eq.( 46)), i.e., where F θ ⋆ denotes the network function with optimized parameters θ ⋆ after training.Since they are way more powerful than the traditional methods, deep methods will soon dominate the research field of CT image denoising.Initially, these methods were primarily trained in a supervised fashion using paired noisy and clean images, as generally depicted by Eq. ( 48), and the successful examples include REDCNN [196], wavelet network [197] and stacked competitive network [198].
The issue of missing paired labels was soon realized when researchers attempted to apply supervised methods in practice.To address this, a number of unsupervised or self-supervised methods have been proposed.For instance, cycle-GAN based techniques are able to utilize unpaired data for training by promoting cycle consistency between domains [205], [206], [222], [225].However, these GAN-based methods have been criticized for potentially generating erroneous structures.Poisson Unbiased Risk Estimator (PURE) and Weighted Stein's Unbiased Risk Estimator (WSURE) are alternative methods that convert the supervised MSE loss calculation into a form that only relies on the noisy input, the network output, and its divergence [226].This approach forms an unsupervised training framework where the divergence term is approximated using Monte-Carlo perturbation method [227].Noise2Noise is another method that enables us to train the network with paired noise-noise images which are equivalent to being trained with original noise-clean pairs, where µ 0 and µ 1 are different noisy realizations of the same image, e.g., two independent CT scans of the same object.
Building on this idea, several recent variant methods have been developed for self-supervised low-dose CT denoising by generating noisy pairs via various approaches [228]- [236].
For instance, Noise2Inverse proposes to partition projection data into several sets and enforcing consistency between corresponding reconstructions [234], while Noise2Context promotes similarity between adjacent CT slices in 3-D thin-layer CT [232]; Half2Half adopts the thinning technique [237] to split a full dose real CT scan into two pseudo half dose scans [230].Spectral CT powerfully extends the conventional single energy CT by introducing an extra energy dimension.However, the splitting of photons into different energy bins increases the noise level of the projection at each bin compared to conventional CT with the same overall radiation dose.Therefore, to achieve optimal denoising performance for spectral CT, it is necessary to leverage inter-bin information, similar to the approach taken in learned synergistic reconstruction (Section III-E), as described below, Several recent papers have explored this direction.UL-TRA [238] incorporates an ℓ p -norm and anisotropic total variation loss to train a residual U-Net with multichannel inputs from PCCT scans.Noise2Sim [235] constructs noisy pairs using the Noise2Noise principle and replaces each pixel from the original noisy image with one of its k-nearest pixels searched from the spatial dimension (including adjacent slices) and measured by non-local means.The multichannel image is fed to the network as a whole, and its value from different bins can be constructed independently to fully leverage the self-similarities within the spectral CT scans.By this means, comparable or even better performance has been demonstrated on experimental PCCT scans against the supervised learning methods.S2MS [231] proposes another interesting approach to leverage the inter-channel information by converting the linear attenuation map from each channel to a channel-independent density map, which forms different noisy realizations of the density images from multiple channels.Promising results from this self-supervised learning idea are demonstrated on a simulation study.
Besides developing various deep denoising methods, researchers have also investigated the effects of noise reduction on the downstream tasks [238], [239].For example, Evans et al. [239] compared the material decomposition results of multi-bin PCCT images before and after denoising with BM3D and Noise2Sim through phantom studies.They found that image denoising improves the accuracy of material concentration quantification results, but not material classification results.In the clinical domain, there are several Food and Drug Administration (FDA)-approved deep denoising methods from multiple vendors (e.g., the TrueFidelity from GE Healthcare, the Advanced Intelligent Clear-IQ Engine (AiCE) from Canon, PixelShine from Algomedica, ClariCT.AI from ClariPI Inc., etc), and numerous studies have been performed to investigate their impacts on clinical significance.For ease of notation, we use deep learning image reconstruction (DLIR) to refer specially to these FDA-approved methods in clinical applications.Noda et al. [240] showed that with DLIR, the radiation dose of whole-body CT can be reduced by up to 75% while maintaining similar image quality and lesion detection rate compared to standard-dose CT reconstruction with iterative reconstruction through a study cohort of 59 patients.This conclusion is also supported in other studies where DLIR and iterative reconstruction of the same patient scans are compared, showing that DLIR provides significantly preferred image quality and reduced noise [241], [242].
For the diagnosis with DECT, the pancreatic cancer diagnostic acceptability and conspicuity can be significantly improved, and the use of DLIR reduces the variation in iodine concentration values while maintaining their accuracy [243].Fukutomi et al. [244] suggests similar results in terms of iodine concentration quantification through both phantom and clinical studies.The stability of iodine quantification accuracy with DLIR has also been investigated in the context of radiation dose variation.For example, Kojima et al. [245] found that the accuracy is not affected by the radiation dose when the dose index is greater than 12.3 mGy.For a more detailed assessment of DLIR in clinical practice, a recent review paper by Szczykutowicz et al. [246] provides a good starting point.It is also worth noting that the aforementioned studies with PCCT [239] and DECT [244] lead to different conclusions about the impacts of denoising on iodine/material concentration quantification, which could be attributed to the different energy discrimination mechanisms between PCCT and DECT, as the number of energy bins and spectral separation can significantly influence the accuracy and stability of material decomposition performance [30].
2) Artifacts Correction: Besides noise, image artifact is another factor that affects the quality of CT image for diagnostic evaluation.Few-view or limited-angle reconstruction is an effective method to reduce the radiation dose, but it can introduce globally distributed artifacts that are difficult to remove.To be concise and avoid overlap with Section III, here we only cover recent progress on post-processing-based artifact reduction approaches via deep learning for spectral CT.The networks are often trained in a supervised manner for this application and directly applied to FBP reconstructions to remove artifacts, which can be similarly described as Eq. ( 48) and Eq. ( 47) with µ 0 and µ 1 being few-view/limitedangle reconstruction and full-view/full-angle reconstruction respectively.For example, to reduce few-view reconstruction artifacts and accelerate reconstruction for scans at multiple energy points (i.e., 32 channels), Mustafa et al. [247] proposed a U-Net-based approach that maps few-view FBP reconstruction images to computationally intensive full-view iterative reconstruction images with TV regularization.The 32-channel FBP images were fed to the network simultaneously and transformed to high-quality 32-channel reconstructions in one step, majorly reducing the computational cost.More recently, Lee et al. [248] developed a multi-level wavelet convolutional neural network, using a U-Net architecture with the wavelet transform as the down-sampling/up-sampling operations, that effectively captures and removes globally distributed fewview artifacts.The network simultaneously processes multichannel images to leverage inter-channel information, and demonstrates promising results both numerically and experimentally with an edged silicon strip PCD.To address limitedangle artifacts for cone beam DECT, Zhang et al. [249] proposed the TIME-Net, which utilizes a transformer module with global attention.In addition, the two complementary limited-angle scans at two energies are fused together to form a prior reconstruction, then the features extracted from the prior reconstruction, high-energy reconstruction, and lowenergy reconstruction are fused in latent space to leverage inter-channel information with the network.
In dual-source DECT scanners, the high-energy imaging chain (i.e., tube B with a tin filter, typically at 140 keV) often has a restricted field of view (FOV) (e.g., 33cm) due to physical constraints compared to the other chain (e.g., 50cm for tube A), which can be problematic for larger patients and affect diagnosis.To outpaint the missing regions and match the size of normal FOV, Liu et al. [250] proposed a selfsupervised method that maps the low-energy image to the high-energy image with a loss function only focusing on image values within the restricted FOV.The outpainting is then automatically completed leveraging the shift-invariant nature of CNNs.Similarly, Schwartz et al. [251] proposed a method for FOV extension that involves feeding both the high-energy image and the low-energy image in the network, along with a high-energy estimation from the low-energy image via a piecewise-linear transfer function.The trained network was applied to patient data for renal lesion evaluation and showed reliable results in terms of HU value and lesion classification accuracy in the extended regions.

C. Image Generation for Clinical Applications
With the recent development of DECT and PCCT techniques, spectral imaging is reshaping the clinical utilization of CT.These techniques enable the generation of multiple types of images that enhance diagnosis and improve disease management, such as virtual monochromatic images (VMIs), virtual unenhanced images, bone suppression images, and material decomposition maps.A good number of research studies have been performed in these areas using deep learning approaches.
1) Single-Energy to Dual-energy Mapping: Despite the great possibilities offered by DECT and PCCT, their accessibility remains limited in comparison to conventional singleenergy CT, largely due to the high cost involved.To bridge the gap, Zhao et al. [252] successfully demonstrated the feasibility of using deep learning to predict high-energy CT images from given low-energy CT images in a retrospective study.Shortly, Lyu et al. [253] proposed a material decomposition CNN capable of generating high-quality DECT images from a low-energy scan combined with a single view high-energy projection, leveraging the anatomical consistency and energydomain correlation between two energy images in DECT.The feasibility of this method has been validated with patient studies, showing great potential for simplifying DECT hardware and reducing radiation exposure during DECT scans.
2) Virtual Monochromatic Image: VMIs are widely used as the basis for routine diagnosis due to their ability to reduce beam-hardening and metal artifacts, and enhance iodine conspicuity.They are obtained by linearly combining the basis material volume fraction maps [115], [254] obtained after material decomposition, as described by the material decomposition model in Section IV.To enhance readability and clarity, Eq. (35), which outlines this model, is replicated here in a spatially discrete form: where x m denotes the volume fraction map of the m th material basis, f m (E) stands for the linear attenuation coefficient of the corresponding material at energy E, and M is the total number of material basis.However, the synthesis of VMIs relies on material decomposition results and is therefore limited to DECT and PCCT, which may not be available in less developed areas.Similar to section V-C1, a number of approaches have been explored to directly synthesize the VMIs from single-energy CT scans.Cong et al. [255] first used a modified ResNet for VMI generation from single polychromatic CT scans, then developed a sinogram domain method [256] synthesizing VMIs with a fully-connected NN for virtual monochromatic energy sinogram prediction from single polychromatic measurements.Kawahara et al. [257] employed a GAN to generate VMIs from equivalent keV-CT images, while Koike et al. [258] used a U-Net for a similar purpose in imaging of head and neck cancers.More interestingly, Fink et al. [259] found that using VMIs synthesized from single-energy CT images for pulmonary embolism classification provides better performance compared to working directly on the original single-energy images.
On the other hand, VMI synthesis is a downstream task after image reconstruction and material decomposition, during which deep denoising plays a role and potentially affects VMI quality in clinical practice.Extensive studies have investigated this effect through quantitative assessment and/or subjective reader studies.Kojima et al. [245] examined VMI CT number accuracy at various radiation doses, finding that accuracy remains unaffected except at extremely low radiation doses (6.3 mGy).Sato et al. [260] compared VMIs from DLIR with routine baselines from hybrid iterative reconstruction for contrast-enhanced abdomeninal DECT imaging, concluding that vessel and lesion conspicuity of VMIs and iodine density images are improved with DLIR.Xu et al. [261] reached a similar conclusion, and particularly they found that 40 keV VMIs from DLIR poses better CNR and similar or improved image quality compared to 50 keV VMI from hybrid iterative reconstruction, suggesting that 40 keV VMI with DLIR could be a new standard for routine low-keV VMI reconstruction.The study for carotid DECT angiography by Jiang et al. [262] also supports the conclusion that DLIR improves the image quality and diagnostic performance of VMIs compared to hybrid iterative reconstruction.This superiority is further confirmed in DECT angiography with reduced iodine dose (200 mgI/kg) in terms of image quality and arterial depiction by Noda et al. [243].Additionally, the effect of direct denoising on VMIs has been investigated.In a study of Lee et al. [263] the post-processed VMI using ClariCT.AI (a FDA-approved vendor-agnostic imaging denoising software) is compared with original standard VMI in the assessment of hypoenhancing hepatic metastasis.The results suggest denoising leads to better image quality and lesion detectability.A similar conclusion was achieved by Seo et al. [264] with the same post-denoising method for the evaluation of hypervascular liver lesions.
3) Contrast Agent Dose Reduction: Iodine-enhanced CT is essential for diagnosing various diseases.However, iodinebased contrast media can cause significant side effects, including allergic reactions in certain patients, and dose-dependent kidney injury and thyroid dysfunction.To investigate the possibility of reducing iodine administration dose while maintaining diagnostic accuracy, Haubold et al. [265] trained a GAN to selectively enhance iodine contrast.They ultimately achieved a 50% contrast dose saving ratio, confirmed by a visual Turing test involving three radiologists assessing pathological consistency.Noda et al. [266] explored the potential of leveraging vendor DLIR for simultaneous iodine and radiation dose reduction in thoraco-abdomino-pelvic DECT imaging.They compared the 40 keV VMIs from DLIR of double lowdose (50% iodine, 50% radiation) scans with VMIs from the hybrid iterative reconstruction of standard dose scans.The diagnostic image quality was achieved in 95% of participants in the double low-dose group, suggesting the feasibility of maintaining diagnostic quality at half doses of radiation and iodine using DLIR.
4) Others: Several other intriguing deep post-processing techniques for spectral CT include virtual non-contrast image synthesis, virtual non-calcium image synthesis, and spectral CT-based thermometry.Virtual non-contrast images can replace non-contrast scans in a DECT scanning protocol, thus saving radiation dose.However, pure physics-based twomeasurement material decomposition algorithms exhibit limited accuracy and stability in the presence of three materials.Poirot et al. [267] employed a CNN to leverage the anatomic information, bridging the gap between material decomposition-derived virtual non-contrast images and the real non-contrast images to generate higher fidelity images.
Virtual non-calcium images are useful for visualizing bone marrow, osteolytic lesions, and even the diagnosis of multiple myeloma [268], [269].Like virtual non-contrast images, they also suffer from excessive noise and artifacts resulting from material decomposition.Gong et al. [137] proposed a custom dual-task CNN that directly maps the input of spectral CT images to material type maps and corresponding mass density maps.The experimental results demonstrate significantly reduced noise and artifacts in virtual non-calcium images and great visibility of bone marrow lesions.
CT-based thermometry provides a non-invasive method for estimating temperature inside the human body by monitoring the attenuation value changes associated with temperaturedependent radiodensity.Heinrich et al. [270] explored the potential of improving temperature sensitivity with VMIs from DLIR of DECT scans compared to conventional singleenergy CT images.Their results show that VMIs significantly enhances temperature sensitivity for different materials, particularly for bone with a boost of 211%.The application of DLIR and hybrid iterative reconstruction has no effect on temperature measurement, suggesting the great potential for dose reduction with deep learning techniques.More recently, Wang et al. [271] incorporated an advanced PCD with 4 energy bin measurements for robust material decomposition and a fully-connected NN for temperature prediction.They observed a non-linear relationship between thermal sensitivity and the concentration of CaCl 2 solution in the experiment, achieving final thermometry accuracies of 3.97 • C and 1.8 • C for 300 mmol/L CaCl 2 solution and a milk-based protein shake, respectively.

VI. PERSPECTIVES
Advances in spectral CT is a major frontier of the medical CT field, which combines cutting-edge hardware for photoncounting detection and AI-empowered software for deep learning-based reconstruction.As we have reviewed above, photon-counting spectral CT promises to significantly improve the medical CT performance in terms of spatial resolution, spectral resolution, tissue contrast, and dose efficiency.The distinguished capability of photon-counting CT in material decomposition is clinically attractive to perform novel multicontrast-enhanced studies and boost CT, not only in anatomical imaging but also functional or even cellular imaging tasks.All of these can be implemented using machine learning methods or coupled with machine learning methods.Most of such machine learning methods are deep neural networks, involving each key step in the whole imaging workflow.
Looking ahead, the convergence of photon-counting and deep-learning techniques will surely establish spectral CT as the new standard of medical CT.To realize the huge potential of photon-counting spectral CT, there remain challenges to be addressed before task-specific methods and protocols can be successfully translated into clinical practice.These challenges include but are not limited to the following aspects.
Direct Reconstruction: Deep NNs have been explored to reconstruct images from sinograms in a number of studies.In this approach, a neural network is trained on a large set of sinogram-image pairs until the network predicts realistic reconstructed images.Here, the NN learns to reconstruct the image and at the same time to reduce noise and to incorporate any other corrections desirable for reconstruction.Early methods developed for tomographic reconstruction using deep networks include AUTOMAP [272] for magnetic resonance reconstruction as well as LEARN [273] and iCT [274] for CT reconstruction.To tackle the computational complexity, more sophisticated and efficient networks were developed [275]- [278].
Direct reconstruction techniques may be extended to multichannel reconstruction including photon-counting spectral CT reconstruction.One possible way would be to have multichannel networks incorporating data in multiple energy bins or an ensemble of networks with weight sharing for each energy.Importantly, correlations among these data in these channels should be utilized; for example, as a term in the loss function.
Locally linear embedding Motion Correction: The muchreduced pixel size of PCDs enables CT imaging at ultrahigh resolution, which is one major advantage of PCCTs over traditional EID-based CT and critical to resolve anatomical and pathologic details, such as cochlear features, lung nodules, and coronary plaques.As resolution drastically improves, the sensitivity to patient motion and geometric misalignment becomes high and can be the limiting factor of image resolution.This increased sensitivity also challenges the assumption of smooth patient movement across views [279]- [281].
To address the issue, Li et al. [282] developed a rigid patient motion compensation method for high-resolution helical PCCT based on locally linear embedding.Their method is in a coarse-to-fine searching framework to boost efficiency, along with several accuracy improving steps masking bad pixel, unreliable volume and patient bed respectively.The method was evaluated on patient wrist scans in a clinical trial, revealing fine bony structures previously hidden by motion blur, as shown in Fig. 5. Subsequently, Li et al. [283] proposed a unified reference-free all-in-one motion correction method for robotic CT with arbitrary scanning trajectories using a nine-degree-of-freedom model, which is capable of addressing rigid patient motion, system misalignment, and coordination errors simultaneously.The effectiveness of the method has been verified on experimental robotic-arm-based PCCT scans of a sacrificed mouse demonstrating a great resolution boost and artifacts reduction.
Diffusion Models: As a score-matching-based generative approach, the diffusion models (DMs) have recently drawn a major attention of the community as they effectively compete or even outperform GANs for image generation and other tasks [284], and have been broadly adapted for medical imaging [285], including PCCT image generation [286].They involve gradually degrading a sample of interest (i.e., an image) with subtle Gaussian noise until the sample becomes a random Gaussian field, learning the noising process in terms of a score function, and then, by inversion from a Gaussian noise realization, generate a meaningful sample [287].Specifically, the inverse process uses the gradient of the log-density of the prior (the score) which is approximated with a NN trained for score matching, and generates an image according to the a-priori probability distribution of the training dataset.
DMs can be used to solve inverse problems by adding a data fidelity gradient descent step in the inverse diffusion, or by using the pseudo a-posteriori probability distribution conditioned to the observed data, which work in an unsupervised manner.These methods have been used in various inverse problems such as deblurring on RGB multichannel images [288].Moreover, the DMs are independent of the measurement model, and the same approaches can be used in multi-energy spectral CT reconstruction or one-step material decomposition under different imaging geometries and sampling conditions.Recently, Hardware Refinement: Over the past years, photoncounting detectors have been greatly refined.There are more efforts on CZT detectors, but deep-silicon detectors are also of great interest.While CZT detectors and alike are more compact, the silicon technology is more mature, reliable and cost-effective with the potential to give more quantitative spectral imaging results.A detailed comparison is yet to be seen.Since the photon-counting detector pitches are substantially smaller than that of the energy-integrating detectors, the spatial resolution of CT images can be accordingly improved, coupled with a reduced X-ray source focal spot.However, a small focal spot usually means a low X-ray flux.Hence, the balance must be made between image resolution, noise and imaging speed.It is underlined that while the hardware refinement in either detectors or sources is important, this kind of research will be more often performed by leading companies than academic groups.Since this review is more focused on computational aspects of spectral CT, in the following we discuss more AI-related challenges.
Big Data Construction: It is well known that big data is a prerequisite for data-driven research.Clearly, it is not easy to have big PCCT data for several reasons, including limited accessibility to PCCT scans, patient privacy, industrial confidentiality, and so on.We believe that this issue must be addressed using simulation tools, and ideally done in a healthcare metaverse.Such an idea was discussed as the first use case in a recent perspective article [289].Along that direction, virtual twins of physical PCCT scanner models can scan patient avatars to produce simulated data.Along a complementary direction, a limited number of real PCCT scans can be used to train a generative model for realistic image augmentation.For example, it was recently shown that the diffusion model can be used to synthesize realistic data with suppressed privacy leakage [290].This will facilitate federated learning at the level of datasets.
AI model Development: When sufficiently informative PCCT data are available, more advanced AI models should be developed to address current weaknesses of deep reconstruction networks in the CT field.The well-known problems of deep networks include stability, generalizability, uncertainty, interpretability, fairness, and more.As briefly mentioned in our review, a unique opportunity in deep learning-based PCCT imaging is raw data correction for charge-sharing, pile-up and other effects.These effects are very complicated, nonlinear and stochastic, but deep learning-based solutions are few and there will be more in the future.Furthermore, large models are gaining great attention, with ChatGPT as a precursor of the next generation of AI methods, i.e., as the first step into the future of artificial general intelligence (AGI).It is believed that large models, multi-modal large models in particular, will further improve the PCCT performance.
High-performance and High-efficiency Computing: Deep learning with large models takes computational resources.Parallel/cloud computing, model distillation and hybrid (combination of classic and deep learning) reconstruction methods can be synergistic to develop practical PCCT methods.Special hardware such as FPGAs [291] could be adapted in PCCT tasks for imaging speed and energy efficiency.
Clinical Translation: The development of accurate and robust PCCT methods should lead to diverse clinical applications, from screening and diagnosis to treatment planning and prognosis.PCCT can be also used to guide minimally invasive procedures, such as biopsy and ablation, by providing real-time information over a region of interest [292].The integration of PCCT (and DECT) with other imaging modalities, such as MRI and PET, would be beneficial as well, leading to a better understanding of anatomical forms and pathological functions.
Hybrid PET/CT Spectral Imaging: The integration of spectral CT with PET has the potential to open novel clinical applications.However, such an integrated system either requires a costly hardware upgrade or is associated with increased radiation exposure.Most existing spectral CT imaging methods are based on a single modality that uses X-rays.Alternatively, it is possible to explore a combination of Xray and γ-ray for spectral imaging [293].The concept of this PET-enabled spectral CT method exploits a standard time-of-flight PET emission scan to derive high-energy γ-ray CT attenuation images and combines the images with lowenergy X-ray CT images to form dual-energy or multi-energy imaging.This method has the potential to make spectral CT imaging more readily available on clinical PET/CT scanners.The enabling algorithm of this hybrid spectral imaging method is the reconstruction of γ-ray attenuation images from PET emission data using the maximum-likelihood attenuation and activity algorithm [293], [294].While the counting statistics of PET emission data are relatively low, machine learning-based approaches have been developed to further improve image reconstruction, for example, using the kernel method alone [293], [295] or in combination with deep neural networks [296]- [298].These reconstruction approaches are directly based on single subjects without requiring pretraining from a large number of datasets.Alternatively, many other big data-based deep learning techniques that are described in Section III, Section IV, and Section V may be applied to the development of hybrid PET/CT spectral imaging.

VII. CONCLUSION
In conclusion, this review has systematically reviewed spectral CT with an emphasis on photon-counting and deep learning techniques.This field has evolved from traditional DECT with an established status in medical imaging to contemporary PCCT with promising results and new utilities.Several remaining challenges have been discussed.The future of this technology looks exciting, with numerous opportunities for us to explore so that our imaging dreams can be turned into reality.
[a, b] is the horizontal concatenation of two column vectors a and b with the same length.{x k } = {x k , k = 1, . . ., K} denotes an ordered collection of vectors where the number of elements K depends on the context.L(•, •) denotes a loss function that evaluates the adequation between 2 vectors, e.g, L(a, b) = i −a i log b i + b i (negative Poisson log-likelihood), or L(a, b) = ∥a − b∥ p p .R is a regularisation functional.

Fig. 3 .
Fig. 3. Material decomposition of simulated PCCT acquisitions of a patient phantom (left) with projection-based (middle) and image-based (right) U-Net CNNs.The two materials of the decomposition are soft tissue (top row) and bone (bottom row).Figure adapted from Abascal et al. [129] and distributed under a Creative Commons Attribution 4.0 License, see https://creativecommons.org/licenses/by/4.0/.

Fig. 4 .
Fig. 4. Overview of sub-topics in Section V.The data pre-processing section covers deep correction methods for spectral distortion (e.g., falsely increased counts in the low energy bin due to the charge sharing effect, and non-linear responses due to the pulse pileup effect) and non-uniformity in PCD projection images.The image post-processing sections discuss deep post-processing methods to enhance DECT and PCCT imaging and their impacts on clinical diagnosis.

Fig. 5 .
Fig. 5. High-resolution PCCT scan of a patient wrist from a clinical trial (90 µm voxel) before and after motion correction (Adapted from Li et al. [282] with permission).