TensorCircuit: a Quantum Software Framework for the NISQ Era

TensorCircuit is an open source quantum circuit simulator based on tensor network contraction, designed for speed, flexibility and code efficiency. Written purely in Python, and built on top of industry-standard machine learning frameworks, TensorCircuit supports automatic differentiation, just-in-time compilation, vectorized parallelism and hardware acceleration. These features allow TensorCircuit to simulate larger and more complex quantum circuits than existing simulators, and are especially suited to variational algorithms based on parameterized quantum circuits. TensorCircuit enables orders of magnitude speedup for various quantum simulation tasks compared to other common quantum software, and can simulate up to 600 qubits with moderate circuit depth and low-dimensional connectivity. With its time and space efficiency, flexible and extensible architecture and compact, user-friendly API, TensorCircuit has been built to facilitate the design, simulation and analysis of quantum algorithms in the Noisy Intermediate-Scale Quantum (NISQ) era.


Introduction
The landscape of open-source and proprietary software for simulating quantum computers [1] has grown in recent years. While features and functionality vary between packages, across the board users now have many high-quality options available for constructing and simulating quantum circuits. However, to deepen our understanding of quantum algorithm performance, researchers increasingly need to simulate larger and more complex quantum circuits, and to optimize quantum circuits that may contain a large number of tunable parameters. In spite of the promising state of existing quantum software, there remain a number of challenges in running large-scale, complex simulations. Here we introduce TensorCircuit , a new open source tensor network based quantum circuit simulator built to address these challenges. Written in Python, and designed for speed, flexibility and ease-of-use, TensorCircuit is built on top of a number of industry leading libraries. Via the TensorFlow [2], JAX [3] and PyTorch [4] machine learning libraries, convenient access is provided for automatic differentiation, just-in-time compilation, vectorized parallelism, and hardware acceleration. Fast tensor network contraction is enabled by the state-of-the-art cotengra [5,6] package, which also gives users customizable control over the tensor network contraction process. These features enable efficient optimization of parameterized quantum circuits, allowing for more complex cases to be modelled. In addition, the TensorCircuit syntax aims to allow complicated tasks to be implemented with a minimal amount of code, saving time spent coding as well as in simulation.

Challenges in simulating quantum circuits
As fully fault-tolerant quantum computers capable of running large scale quantum algorithms may still be many years away, considerable research effort has been spent investigating the prospects for quantum advantage in the nearer term. Some algorithms for Noisy Intermediate-Scale Quantum (NISQ) [7] quantum computers aim to leverage classical computational power to supplement quantum computers, which may have only a limited number of error-prone qubits under control. In particular, hybrid quantum-classical algorithms [8,9] such as the Variational Quantum Eigensolver (VQE) [10] and the Quantum Approximate Optimization Algorithm (QAOA) [11] are based around the concept of parameterized quantum circuits (PQC). These circuits, which contain quantum gates with tunable parameters (for instance, single-qubit gates with variable rotation angles), are embedded in a classical optimization loop. By optimizing the value of the parameters in the circuit, one aims to drive the output state of the quantum circuit towards the solution to a given problem.
In a prototypical VQE example, one wishes to find the ground state energy E 0 of a quantum system with Hamiltonian H. The output of a PQC is a quantum state |ψ(θ) , where θ is a vector of tunable parameters. This trial state -known as an ansatz -forms a guess for the ground state wavefunction. By performing appropriate measurements on |ψ(θ) , one can estimate the expected energy H θ = ψ(θ)| H |ψ(θ) . By minimizing H θ with respect to the parameters θ, one obtains an upper bound estimate of the ground state energy: There are a number of issues that affect the efficacy and efficiency for this approach. Firstly, for an accurate estimate of E 0 , the ansatz |ψ(θ) should, for some values of θ, be a good approximation to the true ground state. Whether this is so depends on the nature of the problem and the complexity of the PQC from which the ansatz is constructed. On the one hand, simple ansätze -for instance, the "hardware efficient" ansatz of [12] -may be easier to implement on real or simulated quantum computers, but may either not be sufficiently accurate (for instance, due to lack of physically-relevant structure or short depth) or else suffer from other issues, such as so-called barren plateaus in parameter space [13] or local minima in the energy landscape [14]. On the other hand, more complex ansätze , such as the unitary coupled cluster (UCC) [15] approach proposed for quantum chemistry problems, may require quantum circuit complexities and depths beyond what can currently be implemented or simulated, and the associated circuits must then be truncated or simplified. The ability to simulate larger and deeper quantum circuits would enable a systematic investigation of larger classes of ansätze.
Secondly, for a given ansatz, the evaluation of H θ can be an involved process. For instance, consider the case where H is an n-qubit Hamiltonian, which can be expressed as a weighted sum of tensor products of Pauli operators, i.e., where α j are real coefficients, each Pauli string P j is of the form P j = σ i1 ⊗ σ i2 ⊗ . . . σ in , and σ ij are single-qubit Pauli operators or the identity. In the most straightforward approach, estimating H θ is performed by first estimating the expectation of each term P j θ and then adding together these individual contributions. If the Hamilonian consists of many Pauli terms, ways of speeding up the evaluation of Eq. (2) -for instance, by exploiting efficient representations of H (e.g. as a sparse matrix or Matrix Product Operator (MPO) [16]), or being able to compute multiple terms in parallel -can have a large impact on the computation time.
Thirdly, the optimization problem in Eq. (1) is, in general, non-convex. Thus, finding a global minimum is in general computationally intractable. However, if gradients of potentially complicated expressions can be efficiently evaluated, one can use gradient descent methods to find a local minimum which may yield a decent approximate solution. By initializing the optimization from a number of different positions (i.e. different values of θ), multiple local minima can be obtained, improving the chances that one of these gives a good solution. If these multiple solutions can be optimized in parallel, one may achieve large efficiency gains.

Machine learning libraries
The challenges discussed in the previous section overlap to a large degree with problems faced in machine learning, and especially deep learning [17]. Fortunately, the need to tackle machine learning problems of increasing complexity and to deal with datasets of ever-larger size, has led to the development of impressive software, and many frameworks are now available that combine powerful features with easyto-use syntax. In particular: Fast gradients. At the heart of all advanced machine learning packages is the ability to perform automatic differentiation (AD) [18,19], of which the backpropagation algorithm used to train neural networks is a special case. AD enables efficient computation of the gradients of functions defined in code, and is vital to the optimization of many machine learning models.
JIT. Just-in-time compilation is a way of compiling certain parts of code during program execution. For interpreted languages such as Python, "jitting" a function can lead to large performance gains, with the time required to execute a jitted function often only a small fraction of the time needed if the function were interpreted. While the first time the function is called there may be an overhead cost required for compilation (staging time), this cost can be negligible to the time saved if the function is subsequently called many times.
Vectorization (VMAP). This feature allows a function to be evaluated on multiple points in parallel, with significant speedup compared with using a naïve for loop. In machine learning this allows one, for instance, to perform computations on batches of data at the same time.
Hardware acceleration. For complex machine learning models, the ability to execute code on multiple CPUs, GPUs and TPUs may be necessary for training to complete in a reasonable amount of time.
These powerful features are also beneficial in simulating quantum computers, and are particularly suited to variational quantum algorithms. Here, fast gradient evaluation via automatic differentiation gives simulators an inherent advantage over real quantum computers, which must estimate gradients in a less direct way; for instance by sampling the outputs for various input parameter choices and computing finite differences [20,21]. Vectorization can (among other things) be used to evaluate PQC circuits with multiple parameter choices concurrently, or compute expectations of multiple Pauli strings simultaneously, and JIT and hardware acceleration provide for further time savings.
In addition, there is increasing interest in solving machine learning problems via quantum computing, as well as combining classical and quantum machine learning (QML) algorithms. Both of these can be facilitated by better integration between classical ML frameworks and quantum circuit simulators, and great value can be derived from a seamless integration of the two.

The next phase of quantum software
While quantum software has progressed a great deal in the last few years -with packages such as Qiskit [22], Cirq [23], ProjectQ [24], HiQ [25], Q# [26], Qibo [27], and qulacs [28] all offering powerful functionality and features -there remain significant advantages to be gained from efficient quantum simulation software supplemented with the power and features of state-of-the-art machine learning. Recently, a new generation of quantum software has started to emerge, with TensorFlow Quantum [29], Pennylane [30], Paddle Quantum [31] and MindQuantum [32] in the Python ecosystem making inroads here, and providing these features to varying degrees. However, to date, none of these fully combine all AD JIT VMAP GPU TN Qiskit/Cirq/ProjectQ/Qulacs · · · · Qibo · · TensorFlow Quantum · · · Pennylane · TensorCircuit Table 1: Quantum software support for main machine learning paradigms and tensor network engine. Check marks qualitatively indicate the level of support for the corresponding features: = good support, = limited support, · = not supported. TN is for tensor network simulation engine support. Many well-known quantum software frameworks such as Qiskit or Cirq have no support for AD, JIT and VMAP; Qibo does not support AD with its most optimized and efficient backend; TensorFlow Quantum supports AD of expectation values but not directly of the wavefunction, and does not support quantum circuit simulation on GPU; Pennylane does not support VMAP of trainable parameters and JIT support is fragile as some methods are not JIT-compatible, etc. = good support, = limited support,· = not supported. TensorFlow and JAX both offer comprehensive support for AD, JIT, VMAP and hardware acceleration, and are recommended for most tasks. PyTorch currently has limited support for JIT and VMAP as these are implemented in experimental modules and are not yet stable. In TensorCircuit we supplement the AD infrastructure of TensorFlow with a VMAP-compatible implementation of Jacobian and Hessian calculations, and enrich the functionality of TensorFlow's VMAP, which in TensorCircuit now supports VMAP over multiple arguments. These improvements can be accessed via the unified backend abstraction provided by TensorCircuit . If no backend is chosen, TensorCircuit defaults to using NumPy as a backend, which does not support these advanced ML features. For more on the choice of ML backend, please see the documentation: faq.html#which-ml-framework-backend-should-i-use .
of the key features of the previous section with fast quantum circuit simulation. That is, there is no good solution for space and time efficient noiseless and noisy quantum simulation that combines a tensor network engine with the machine learning paradigms of AD, JIT and VMAP, as well as GPU support. (See Table 1 for comparison of quantum software in terms of machine learning paradigm compatibility.) TensorCircuit has been designed to fill this gap, and give users a faster, more flexible and more convenient way to simulate quantum circuits and quantum processes.

TensorCircuit Overview
Our goal with TensorCircuit is to provide the first efficient, tensor network based quantum simulator that is fully compatible with the key features of modern machine learning frameworks, especially the programming paradigms of automatic differentiation, vectorized parallelism and just-in-time compilation. These features are provided via a number of popular machine learning backends which, at the time of writing, are TensorFlow, JAX and PyTorch (see Table. 2).
Integration with these backends allows for general hybrid quantum-classical models, where the outputs of a parameterized quantum circuit may be fed into a classical neural network or vice versa, and simulated seamlessly (see Figure 1). This integration is also key for research related to quantum machine learning [33].
Currently, there are very limited options for tensor network based quantum simulators, with most popular software making use of state vector simulators. State vector simulators are strongly limited by memory as wavefunction amplitudes are stored in full, and can thus struggle to simulate circuits with larger numbers of qubits. On the other hand, quantum circuits with large numbers of (possibly noisy) qubits but relatively short circuit depths -such as those corresponding to NISQ devices, whose qubits have short coherence times -fall into the applicable region of tensor network simulators.

Design philosophy
With seamless integration with modern machine learning paradigms supported by an efficient tensor network based simulation engine, the design of TensorCircuit has, from the start, been based on the following principles.
Speed. TensorCircuit uses tensor network contraction to simulate quantum circuits, and is compatible with state-of-the-art third party contraction engines. In contrast, the majority of current popular quantum simulators are state-vector based. The tensor contraction framework allows TensorCircuit , in many cases, to simulate quantum circuits with improved efficiency in terms of time and space compared with other simulators. JIT, AD, VMAP and GPU support can all also provide significant acceleration (often, of several orders of magnitude) in many scenarios.
Flexibility. TensorCircuit is designed to be backend agnostic, making it easy to switch between any of the machine learning backends with no change in syntax or functionality. Via different ML backends, there is flexibility to simulate hybrid quantum-classical neural networks, run code on CPUs, GPUs and TPUs, and switch between 32 bit and 64 bit precision data types.
Code efficiency. Modern machine learning frameworks such as TensorFlow, PyTorch and JAX have user-friendly syntax, allowing for powerful tasks to be carried out with a minimal amount of code. With TensorCircuit , we are similarly focused on a compact and easy-to-use API to boost readability and productivity. Compared with other popular quantum simulation software, TensorCircuit can often perform similar tasks with significantly less code (see tc vs. tfq for VQE for an example). The backend agnostic syntax of TensorCircuit additionally makes it easy to switch between ML frameworks with a single line of setup code.

Community focused.
TensorCircuit is open source software, and we care about readability, maintainability and extensibility of the codebase. We invite all members of the quantum computing community to take part in its continued development and use.

Tensor network engine
TensorCircuit simulates quantum circuits using the tensor network formalism, which has a long history in computational physics and was more recently pioneered for quantum circuit simulation [34] . Indeed, the graphical representations for quantum circuits and tensor networks are consistent, rendering a direct and simple translation from quantum circuit simulation to tensor network contraction. In this picture, quantum circuits are represented by a network of low-rank tensors corresponding to individual quantum gates, and the computation of amplitudes, expectation values or other scalar quantities is performed by contracting the edges of the network until only a single node remains. The order, or path, in which the edges are contracted is important, and can have a large impact on the time and space required to contract the network [6,35,36]. See [37] for a good introduction to tensor networks from a physics perspective.
In the most general case, as for other types of quantum simulators, the time required to simulate quantum circuits via tensor network contraction is exponential in the number of qubits. However, in special cases -including many of practical relevance -tensor network contraction can offer significant advantages since it avoids the memory bottleneck that plagues full state simulators and, to date, the largest scale quantum computing simulations such as the simulation of random circuits used in quantum supremacy experiments [38,39] have all been performed via this approach [40,41,42,43,44,45].
The tensor network data structure used in TensorCircuit is powered by the TensorNetwork [46] package. In addition, TensorCircuit can utilize state-of-the-art external Python packages such as cotengra for selecting efficient contraction paths, and the contraction is then performed via einsum and matmul by the machine learning backend selected by the user (see Figure 2).
Architecture. The tensor network engine underlying the simulation of quantum circuits is built on top of various machine learning frameworks with an abstraction layer in between that unifies different backends. At the application layer, TensorCircuit also includes various advanced quantum algorithms based on our latest research [47,48,49,50,51,52,53]. The overall software architecture is shown in Figure 3. Figure 1: A general hybrid quantum-classical neural network, where a cost function C is summed over a batch of input states {ρj} and is dependent on the parameters of a quantum circuit and classical neural network. By using a classical optimizer, the parameters of both networks can be iteratively improved. Integration with classical machine learning backends allows the entire end-to-end process to be seamlessly simulated in TensorCircuit , with vmap, jit and automatic differentiation enabling efficiency gains throughout the optimization process.  Figure 1. In TensorCircuit , the gates that comprise the quantum circuit are contained in a tc.Circuit object. Computation of circuit outputs is executed in two steps. First, a tensor contraction path is determined by a contraction engine, e.g., cotengra, and the backends take care of the actual contraction.

Installing and contributing to TensorCircuit
TensorCircuit is open-sourced under the Apache 2.0 license. The software is available on the Python Package Index (PyPI) and can be installed using the command pip install tensorcircuit.
The development of TensorCircuit is open-sourced and centered on GitHub: TensorCircuit Repository . We welcome all members of the quantum community to contribute, whether it is • Answering questions on the discussion page or issues page.
• Raising issues such as bug reports or feature requests on the issues page.
• Improving the documentation (docstrings/tutorials) by pull requests.
• Contributing to the codebase by pull requests.
For more details, please refer to the contribution guide: contribution.html .  Jupyter notebook: 3-circuits-gates.ipynb

Circuits and gates
In TensorCircuit , a quantum circuit on n qubits -which supports both noiseless and noisy simulations via Monte Carlo trajectory methods -is created by the tc.Circuit(n) API. Here we show how to create basic circuits, apply gates to them, and compute various outputs.

Preliminaries
In the remainder of this document, we assume that we have both TensorCircuit and NumPy imported as 1 import tensorcircuit as tc 2 import numpy as np Furthermore we assume that a TensorCircuit backend has been set, e.g. 1 K = tc . set_backend ( " tensorflow " ) and the symbol K that appears in code snippets (e.g. K.real()) refers to that backend. Other options for the set_backend method are "jax", "pytorch" and "numpy" (the default backend).
In TensorCircuit , qubits are numbered from 0, with multiqubit registers numbered with the zeroth qubit on the left, e.g. |0 q0 |1 q1 |0 q2 . Unless needed, we will omit subscripts and use compact notation e.g. |010 to denote multiqubit states. X, Y, Z denote the standard single qubit Pauli operators, with subscripts e.g. X 3 to clarify which qubit is acted on. Expectation values of operators with respect to a state |ψ such as ψ| Z ⊗ I ⊗ X |ψ will be denoted in shorthand as Z 0 X 2 . Unless stated, expectation values are always with respect to the output state of a given quantum circuit. If that circuit is parameterized by a set of angles θ, then the parameter-dependent expectation value may be denoted by · θ . In TensorCircuit the default runtime datatype is complex64, but if higher precision is required this can be set as follows:

Basic circuits and outputs
Consider the following two-qubit quantum circuit consisting of a Hadamard gate on qubit q0, a CNOT on qubits q0 and q1, and a single qubit rotation R X (0.2) of qubit q1 by angle 0.2 about the X axis (see Figure 4). Qubits are numbered from 0 with q 0 on the top row. This circuit can be implemented in TensorCircuit as From this, various outputs can be computed.
Basic outputs. The full wavefunction can be obtained via 1 c . state () which will output an array [α 00 , α 01 , α 10 , α 11 ] corresponding to the amplitudes of the |00 , |01 , |10 , |11 basis states. The full wavefunction can also be used to generate the reduced density matrix of a subset of the qubits, e.g. The unitary matrix corresponding to the entire quantum circuit can also be output: where the optional name argument specifies how this gate is displayed when the circuit is output to L A T E X.
Exponential gates. Gates of the form e iθG where matrix G satisfies G 2 = I admit a fast implementation via the exp1 command, e.g., the two qubit gate e iθZ⊗Z acting on qubits 0 and 1 1 c . exp1 (0 , 1 , theta =0.2 , unitary = tc . gates . _zz_matrix ) where tc.gates._zz_matrix creates a numpy array corresponding to the the matrix General exponential gates, where G 2 = I can be implemented via the exp command: Non-unitary gates. TensorCircuit also supports the application of non-unitary gates by supplying a non-unitary matrix as the argument to c.unitary, e.g.

Specifying the input state and composing circuits
By default, quantum circuits are applied to the initial all-zero product state. Arbitrary initial states can be set by passing an array containing the input state amplitudes to the optional inputs argument of tc.Circuit. For example, the maximally entangled state |00 +|11 √ 2 can be input as follows: 1 c1 = tc . Circuit (2 , inputs = np . array ([1 , 0 , 0 , 1] / np . sqrt (2) ) ) Input states in Matrix Product State (MPS) form can also be input via the optional mps_inputs argument of tc.Circuit. See Section 6.4 for details. Circuits that act on the same number of qubits can be composed together via the c.append() or c.prepend() commands. With c1 defined as above, we can create a new circuit c2 and then compose them together: This leads to a circuit C 3 which is equivalent to first applying C 1 and then C 2 .

Circuit transformation and visualization
tc.Circuit objects can be converted to and from Qiskit QuantumCircuit objects. Export to Qiskit is done by 1 c . to_qiskit () and the resulting QuantumCircuit object can then be compiled and run on compatible physical quantum processors and simulators. Conversely, importing a QuantumCircuit object from Qiskit is done via There are two ways to visualize quantum circuits generated in TensorCircuit . The first is to use which outputs the code for drawing the associated quantum circuit using the L A T E Xquantikz package [54]. The second method uses the draw function: As an example, we take n = 3, k = 2, set TensorFlow as our backend, and define an energy cost function to be minimized While this example was done with the TensorFlow backend, switching to JAX can be done easily. All that is required is to redefine the optimizer opt using the JAX optimization library optax [55]: Then, choose JAX as the backend via and perform the gradient descent exactly as above. Note that if no backend is set explicitly, TensorCircuit defaults to using NumPy as the backend, which does not allow for automatic differentiation.

Optimization via SciPy
An alternative to using the machine learning backends for the optimization is to use SciPy. This can be done via the scipy_interface API call, and allows for gradient based (e.g. BFGS) and non-gradient based (e.g. COBYLA) optimizers to be used, which are not available via the ML backends.

Density matrices and mixed state evolution
Jupyter notebook: 5-density-matrix.ipynb TensorCircuit provides two methods for simulating noisy, mixed state quantum evolution. Full density matrix simulation of n qubits is provided by using tc.DMCircuit(n), and then adding quantum operations -both unitary gates as well as general quantum operations specified by Kraus operators -to the circuit. Relative to pure state simulation of n qubits via tc.Circuit, full density matrix simulation is twice as memory intensive, and thus the maximum system size simulatable will be half of what can be simulated in the pure state case. A less memory intensive option is to use the standard tc.Circuit(n) object and stochastically simulate open system evolution via Monte Carlo methods.

Density matrix simulation with tc.DMCircuit
We illustrate this method below, by considering a simple circuit on a single qubit, which takes as input the mixed state corresponding to a probabilistic mixture of the |0 state and the maximally mixed state This state is then passed through a circuit which applies an X gate, followed by a quantum operation corresponding to an amplitude damping channel E γ with parameter γ. This has Kraus operators This circuit thus induces the evolution To simulate this in TensorCircuit , we first create a tc.DMCircuit (density matrix circuit) object and set the input state using the dminputs optional argument (note that if a pure state input is provided to tc.DMCircuit, this should be done via the inputs optional argument rather than dminputs). ρ(α) has the matrix form and thus (taking α = 0.6) we initialize the density matrix circuit as follows Adding the X gate (and other unitary gates) is done in the same way as for pure state circuits: To implement a general quantum operation such as the amplitude damping channel, we use general_kraus, supplied with the corresponding list of Kraus operators. In this example we input the Kraus operators for the amplitude damping channel manually, in order to illustrate the general approach to implementing quantum channels with the tc.DMCircuit method. In fact, TensorCircuit includes built-in methods for returning the Kraus operators for a number of common channels, including the amplitude damping, depolarizing, phase damping and reset channels. For example, the Kraus operators for the phase damping channel with parameter γ can be returned by calling 1 gamma = 0.3 2 K0 , K1 = tc . channels . p h a s e d a m p i n g c h a n n e l ( gamma ) and the phase damping channel added to a circuit via The above operation can be further simplified using one API call:

Monte Carlo simulation with tc.Circuit
Monte Carlo methods can be used to sample noisy quantum evolution using tc.Circuit instead of tc.DMCircuit, where the mixed state is effectively simulated by an ensemble of pure states. As for density matrix simulation, quantum channels E can be added to a circuit object by providing a list of their associated Kraus operators {K i }. The API is the same as for the full density matrix simulation: (2) ) 2 c = tc . Circuit (1 , inputs = input_state ) 3 c . general_kraus ( tc . channels . p h a s e d a m p i n g c h a n n e l (0.5) , 0) 4 c . state () In this framework though, the output of a channel acting on |ψ , i.e.
Thus, the code above stochastically produces the output of a single qubit initialized in state |ψ = |0 +|1 √ 2 being passed through a phase damping channel with parameter γ = 0.5.
The Monte Carlo simulation of channels where the Kraus operators are all unitary matrices (up to a constant factor) can be more efficiently handled by using unitary_kraus instead of general_kraus. For instance, the depolarizing channel with Kraus operators parameterized by p x , p y , p z : can be implemented by 1 px , py , pz = 0.1 , 0.2 , 0.3 2 c . unitary_kraus ( tc . channels . d e p o l a r i z i n g c h a n n e l ( px , py , pz ) , 0) where, in the second line, tc.channel.depolarizingchannel(px, py, pz) returns the required Kraus operators.

Externalizing the randomness
The general_kraus and unitary_kraus examples above both handle randomness generation from inside the respective methods. That is, when the list of Kraus operators is supplied to general_kraus or unitary_kraus, the method partitions the interval [0, 1] into m contiguous intervals [0, 1] = I 0 ∪ I 1 ∪ . . . I m−1 where the length of I i is proportional to the relative probability of obtaining the outcome i. Then a uniformly random variable x in [0, 1] is drawn from within the method, and outcome i is selected based on which interval x lies in. In TensorCircuit , we have full backend agnostic infrastructure for random number generation and management. However, the interplay between jit, random numbers and backend switching is often subtle if we rely on the random number generation inside these methods. See advance.html#randoms-jit-backend-agnostic-and-their-interplay for details.
In some situations, it may be preferable to first generate the random variable from outside the method, and then pass the value generated into general_kraus or unitary_kraus. This can be done via the optional status argument: This is useful, for instance, when one wishes to use vmap to batch compute multiple runs of a Monte Carlo simulation. This is illustrated in the example below, where vmap is used to compute 10 runs of the simulation in parallel: creates a function which acts as and the argument vectorized_argnums=0 indicates that is the zeroth argument (in this case the only argument) of f that we wish to batch compute in parallel.

Advanced features
This section consists of advanced features that the general reader may skip on first reading. The reader interested in the benchmark studies of Section 7 may wish to refer to this section though, as those examples use a number of concepts from here.

Conditional measurements and post-selection
Jupyter notebook: 6-1-conditional-measurements-post-selection.ipynb TensorCircuit allows for two kinds of operations to be performed that are related to measurement outcomes. These are (i) conditional measurements, the outcomes of which can be used to control downstream conditional quantum gates, and (ii) post-selection, which allows the user to select the post-measurement state corresponding to a particular measurement outcome.

Conditional measurements
The cond_measure command is used to simulate the process of performing a Z measurement on a qubit, generating a measurement outcome with probability given by the Born rule, and collapsing the wavefunction in accordance with the measured outcome. The classical measurement outcome obtained can then act as a control for a subsequent quantum operation via the conditional_gate API and can be used, for instance, to implement the canonical teleportation circuit.

Post-selection
Post-selection is enabled in TensorCircuit via the post_select method. This allows the user to select the post-Z-measurement state of a qubit via the keep argument. Unlike cond_measure, the state returned by post_select is not normalized. As an example, consider . The first qubit (q 0 ) is then measured in the Z-basis, and the unnormalized state |11 / √ 2 corresponding to the measurement outcome 1 being post-selected.
This post-selection scheme with unnormalized states is fast and can, for instance, be used to explore various quantum algorithms and nontrivial quantum physics such as measurement-induced entanglement phase transitions [56,57,58,59,60].

Pauli string expectation
Jupyter notebook: 6-2-pauli-string-expectation.ipynb Minimizing the expectation values of sums of Pauli strings is a common task in quantum algorithms. For instance, in the VQE ground state preparation of an n-site transverse-field Ising model (TFIM) with Hamiltonian with respect to the circuit parameters θ. TensorCircuit provides a number of ways to compute expressions of this form, useful in different scenarios. At a high level these are • Looping over terms, each computed by c.expectation_ps.
• Supplying a dense matrix, sparse matrix or MPO representation of the Hamiltonian to the opera-tor_expectation function.
• Using vmap to compute each of the terms in a vectorized parallel manner, where each term is input as a structure vector.
Underlying these methods are a variety of ways of representing strings of Pauli operators and converting between these representations. Before going through the above methods in more detail, let us introduce the Pauli structure vector representation utilized in TensorCircuit .

Pauli structures and weights
A string of Pauli operators acting on n qubits can be represented as a length-n vector v ∈ {0, 1, 2, 3} n , where the value of v i = j corresponds to σ j i , i.e. Pauli operator σ j acting on qubit i (with σ 0 = I, σ 1 = X, σ 2 = Y, σ 3 = Z). For example, in this notation, if n = 3 the term X 1 X 2 corresponds to v = [0, 1, 1]. We refer to such a vector representation of a Pauli string as a structure, and a list of structures, one for each Pauli string term in the Hamiltonian, is used as the input to compute sums of expectation values in a number of ways. If each structure has an associated weight, e.g. the term X i X i+1 has weight J i in Hamiltonian (3), then we define a corresponding tensor of weights The operator op itself can be expressed in one of three forms: (i) dense matrix, (ii) sparse matrix, and (iii) Matrix Product Operator (MPO).
Dense Matrix Input. As a simple example, take the Hamiltonian The matrix elements above were input by hand. TensorCircuit also provides a way of generating the matrix elements from the associated Pauli structures and weights, e.g.

vmap over Pauli structures
Given a state |s , the expectation value s| P |s of a Pauli string P with a given structure can be computed as a function of tensor inputs as 1 # assume the following are defined 2 # state : 2** n vector of coefficients corresponding to input state 3 # structure : Pauli structure ( i . e . list of integers {0 ,1 ,2 ,3}** n ) 4 5 structure = tc . ar ra y _t o_ te n so r ( structure )    We benchmark these different approaches to Pauli string sum evaluation for Hamiltonians corresponding to an H 2 O molecule and the TFIM spin model, with results in Tables 3 and 4, respectively. See examples/vqeh2o_benchmark.py and examples/vqetfim_benchmark.py for more details. In the H 2 O case, we observe an acceleration of 85,000 times for VQE evaluation using the sparse matrix representation compared to a naïve loop (also implemented in TensorCircuit ) as used in most other quantum software.
General remark on benchmark times: Throughout this work, when averages times per circuit or gradient evaluation are reported (either using TensorCircuit or other software), the initial compilation time associated with JIT is not reported. In practice, we find that this initial compilation time is typically an order of ten to several tens larger than the time required for subsequent evaluations, and its effect on the total algorithmic running time depends on how many iterations it is amortized over. If a circuit is only evaluated a few times, the time required for JIT compilation may outweigh the benefits it brings. However, in typical variational quantum algorithms or in noise simulation based on Monte Carlo trajectories, circuits are evaluated a large number of times, and the amortized cost of JIT compilation is negligible.

vmap and vectorized_value_and_grad
Jupyter notebook: 6-3-vmap.ipynb As we have seen in Sections 5.2.1 and 6.2.4, vmap allows for batches of function evaluations to be performed simultaneously in parallel. If batch evaluation of gradients as well as function values is required, then this can be done via vectorized_value_and_grad. In the simplest case, consider a function f (x, y) where x ∈ R p , y ∈ R q are both vectors, and one wishes to evaluate both f (x, y) and , . . . , ∂f (x,yq) ∂yq over a batch x 1 , x 2 , . . . , x k of inputs x. This is achieved by creating a new, vectorized value-and-gradient function which takes as zeroth argument the batched inputs expressed as a k × p tensor, and as first argument the variables we wish to differentiate with respect to. The outputs are a vector of function values evaluated at all points (x i , y), and the gradient averaged over all those points. A toy example is implemented as follows: 1 def f (x , y ) : argnums indicates which argument we wish to take derivatives with respect to, and vectorized_argnums indicates which argument corresponds to the batched input. It is even possible to set the values of argnums and vectorized_argnums to be the same, i.e., we batch compute over different initial values of the parameters we wish to optimize over. This can be useful, for example, in batched VQE computations (see Section 6.3.5).
Then, both the value f (ψ, w) and the gradient with respect to the weights ∇ w f (ψ, w) can be evaluated for a batch of k input states |ψ 1 , . . . , |ψ k simultaneously.

Batched circuits
Consider a family of parameterized quantum circuits U 1 (w), . . . , U k (w) acting on the same number of qubits, where each circuit U i (w) can be expressed as a different parametrization of a single parent circuit, i.e.
where x j is a vector of parameters. A loss function f is defined on these circuits, e.g. f (w, x j ) = 0| U † j X 1 U j |0 and its gradient with respect to the weights w can also be batch evaluated across all circuits simultaneously, e.g. by defining In this case, the batched inputs correspond to the first argument, and the variables to differentiate with respect to correspond to the zeroth argument: 1 f_vvg = K . v e c t o r i z e d _ v a l u e _ a n d _ g r a d (f , argnums =0 , v e c t o r i z e d _ a r g n u m s =1) 2 f_vvg ( weights , x_matrix )

Batched cost function evaluation
As discussed in Section 6.2.4, vmap can be used to accelerate the computation of expectations of sums of Pauli strings by making use of the Pauli structure vectors representation. The same can be done for gradients of such expectation values using vectorized_value_and_grad. Consider a circuit on n = 3 qubits, with k = 4 Pauli terms and cost function Schematically, this returns with each vector v i corresponding to a different Pauli term.

Batched Machine Learning
In machine learning problems, one often wishes to perform batch computation over (data, label) pairs. This can be done by supplying a tuple of indices to the vectorized_argnums argument of vectorized_value_and_grad. Consider the following toy problem where data vectors x ∈ [0, 1] 2 have labels y ∈ {0, 1} and one wishes to train the weights of a parameterized quantum circuit based on a training set of (x, y) pairs. The x vectors are encoded in the angles of an initial set of parameterized gates, with the remaining weights w to be trained to minimize the cost function below:  Using vectorized_value_and_grad now requires the vectorized_argnums argument to be a tuple corresponding to the x, y argument indices: 1 f_vvg = K . v e c t o r i z e d _ v a l u e _ a n d _ g r a d ( In the f (x, y, w) cost function, x, y are the zeroth and first arguments (which we wish to batch compute over), while w is the second argument, which we wish to take derivatives with respect to. The has_aux argument, if set to True, indicates that the function returns a tuple with only the first element differentiated. In our case, the first output is the loss function to be minimized, and the second auxiliary output is the predicted label yp, which is also helpful to keep for calculating other metrics such as AUC or ROC. The batch computation can then be performed as follows 1 X = tc . ar ray_ t o_ te ns o r ([ [3 , 2] , [1 , -4]  While the above is a toy problem for illustrative purposes, combining this method with jit can be a powerful approach to finding ground state energies via VQE, starting from multiple initial points in parameter space and finding the best local minimum reached by gradient descent. The batched VQE workflow, which admits independent optimization loops running simultaneously, is sketched in Figure 8.

Batched Monte Carlo trajectory noise simulation
As introduced in Section 5.2.1, by using external random number arrays, we can simulate quantum noise via multiple Monte Carlo trajectories computed in a vectorized parallel fashion. Each trajectory is a specific instance of the noisy circuit, where each quantum channel is replaced by a stochastically chosen operator determined by external random input. The following example shows how to vmap quantum noise (external randomness) in TensorCircuit in a variational quantum algorithm.

QuOperator and QuVector
Jupyter notebook: 6-4-quoperator.ipynb tc.quantum.QuOperator, tc.quantum.QuVector and tc.quantum.QuAdjointVector are data classes which behave like matrices and vectors (columns or rows) when interacting with other ingredients, while their inner structures correspond to tensor networks for efficiency and compactness.
Typical tensor network structures for a QuOperator/QuVector correspond to Matrix Product Operators (MPO) / Matrix Product States (MPS). The former represents a matrix as: where d is known as the bond dimension. Similarly, an MPS represents a vector as: where the T i k k are, again, d × d matrices. MPS and MPO often occur in computational quantum physics contexts, as they give compact representations for certain types of quantum states and operators. For an introductory review on MPS/MPO in quantum physics, please refer to [16].
QuOperator/QuVector objects can represent any MPO/MPS, but they can additionally express more flexible tensor network structures. Indeed, any tensor network with two sets of dangling edges of the same dimension (i.e., for each k, the set {T i k ,j k k } i k ,j k of matrices has i k and j k running over the same index set) can be treated as a QuOperator. A general QuVector is even more flexible, in that the dangling edge dimensions can be chosen freely; thus, arbitrary tensor products of vectors can be represented.
In this section, we will illustrate the efficiency and compactness of such data structures, and show how they can be integrated seamlessly into quantum circuit simulation tasks. First, consider the following code and tensor diagram (Figure 9) as an introduction to these data class abstractions. Note that the convention in defining a QuOperator is to first state out_edges (left index or row index of the matrix) and then state in_edges (right index or column index of the matrix). Also note that tc.gates.Gate is just a wrapper for the Node object in the TensorNetwork package.
As seen above, QuOperator/QuVector objects support matrix-matrix or matrix-vector multiplication via the @ operator. Other common matrix/vector operations are also supported: nvector eval_matrix Figure 9: Tensor network schematic demonstrating the usage of QuOperator and QuVector. Each node is constructed from a tensor and each QuOperator is built by specifying the dangling edges of nodes. From the users' perspective these objects behave as matrices and vectors, with the tensor network engine responsible for maintaining their inner structures and performing calculations.

QuVector as the input state for the circuit
Since a QuVector behaves like a regular vector, albeit with a more compact representation, it can be used as the input state to a quantum circuit instead of a state represented as a regular array. The benefit of doing so is memory efficiency. For an n-qubit circuit, regular vector inputs require 2 n complex values to be stored in memory. On the other hand, for an MPS with bond dimension d represented by a QuVector, only O(nd 2 ) complex elements in total need to be stored. Such compact MPS representations can be obtained, for instance, by DMRG [62] calculations, and a DMRG ground state to quantum machine learning model pipeline can thus be built in TensorCircuit with the help of this feature.
The following example shows how we input the |111 state, encoded as an MPS, to a quantum circuit. Note how mps_inputs argument is used when constructing the circuit.

QuVector as the output state of the circuit
For a given input state, a tc.Circuit object can itself be treated as a tensor network with one set of dangling edges corresponding to the output state, and thus the whole circuit object can be regarded as a QuVector, obtained via c.quvector(). We can then further manipulate the circuit using QuOperator objects.

QuOperator as the operator to be measured
As shown in Section 6.2, Hamiltonians can also be represented by a QuOperator. This can be a powerful and efficient approach for computing expectation values for some lattice model Hamiltonians like the Heisenberg model or transverse field Ising model (TFIM), as the MPO form of the Hamiltonian for such short-ranged spin models has a very low bond dimension (e.g., d = 3 for TFIM). For comparison, for an n-qubit TFIM Hamiltonian, the dense matrix representation stores O(2 2n ) complex elements, the sparse matrix representation stores O(n2 n ) complex elements, while the MPO or QuOperator representation stores only 18n complex elements, which scales linearly with the system size.
Here we show a toy example of measuring the expectation value of an operator represented as a QuOperator. The quantity of interest here is Z 0 Z 1 , where the expectation is with respect to the output of a simple circuit which consists of an X gate on one of the qubits. Instead of using the c.expectation() API, we use mpo_expectation.

QuOperator as a quantum gate
Since quantum gates correspond to unitary matrices, an MPO representation for these matrices may allow for significant space efficiencies in some scenarios. A typical example is the multi-controlled gate, which admits a very compact MPO representation that can be characterized by a QuOperator in TensorCircuit . For an n-qubit gate with n − 1 control qubits, the matrix representation for this gate has 2 2n elements while the MPO representation can be reduced to bond dimension d = 2, leading to only 16n elements in memory.
TensorCircuit has a built-in way to generate these efficient multi-controlled gates: The 0,2,1 arguments refer to the qubits the gate is applied on (the ordering matters, with the final index referring to the target qubit), the unitary argument defines the operation that is applied if all controls are activated and the ctrl argument refers to whether the control is activated when the corresponding control qubit is in the 0 state or 1 state. General MPO gates expressed as a QuOperator can also be applied via the c.mpo(*index, mpo=) API, in the same way that general unitary matrices can be applied via the c.unitary(*index, unitary=) API (see Section 3.2).

Custom contraction settings
Jupyter notebook: 6-5-custom-contraction.ipynb By default, TensorCircuit uses a greedy tensor contraction path finder provided by the opt_einsum package. While this is typically satisfactory for moderately sized quantum circuits, for circuits with 16 qubits or more, we recommend using customized contraction path finders provided by the user or third-party packages.
A simple quantum circuit can be constructed using the following code as a testbed for different contraction methods: # by reuse = False , we compute the expectation as a single tensor network instead of first computing the wavefunction By using tc.templates.blocks.example_block, a circuit with d layers of exp(iθZZ) gates and R x gates is created. When is_split is True, each two-qubit gate will not be treated as an individual tensor but will be split into two connected tensors via singular value decomposition (SVD), which further simplifies the tensor network structure of the corresponding circuit. The task is to calculate the expectation value of the Z operator on the middle (i.e, n/2-th) qubit.
The API for contraction setup is tc.set_contractor. In our example, 2n × d tensors need to be contracted since single-qubit gates can be absorbed into two-qubit gates when preprocessing is set to True in set_contractor. We have some built-in contraction path finder options such as "greedy", "branch", and "optimal" from opt-einsum [63], though only the default "greedy" option is suitable for circuit simulation tasks as other options require time exponential in the number of qubits. The contraction_info option in this setup API, if set True, will print the contraction path information after the path searching. Metrics that measure the quality of a contraction path include • FLOPs: the total number of computational operations required for all matrix multiplications involved when contracting the tensor network via the given path. This metric characterizes the total simulation time.
• WRITE: the total size (the number of elements) of all tensors -including intermediate tensorscomputed during the contraction.
• SIZE: the size of the largest intermediate tensor stored in memory.
Since simulations in TensorCircuit are AD-enabled, where all intermediate results need to be cached and traced, the more relevant spatial cost metric is write instead of size.

Customized contraction path finder
For large quantum circuits, the performance of the default "greedy" contraction path finder may not be satisfactory. If this is the case, a custom contraction path finder can be used to enhance the performance of contraction by finding better paths in terms of flops (time) and writes (space). Here we use the path finder provided by the third-party cotengra package , a python library for contracting tensor networks or computing einsum expressions. The way to use the cotengra path finder in TensorCircuit is as follows: 1 import cotengra as ctg A number of parameters are used to configure a contraction path finder opt in cotengra: method decides the strategy this path finder will be based on. minimize decides the score function you want to minimize during the path finding, and can be set as "write", "flops", "size" or a combination of these. A time limit and a limit on the number of trial contraction trees can also be set using max_time and max_repeats respectively. For more details, refer to the cotengra documentation. You can also design your own contraction path finder as long as you provide an opt function compatible with the interface of the opt_einsum optimizer.

Subtree reconfiguration
Given a contraction path, e.g. given by a "greedy" search, its performance can be further enhanced by conducting a so-called subtree reconfiguration. This process repeatedly optimizes subtrees of the whole contraction tree, and in practice often results in a better contraction path. This can be done in TensorCircuit as follows: Notice that subtree_reconfigure_forest is used after finding a contraction tree. In this function, you can set the number of trees in the random forest whose contraction paths will be updated, and also the metric to be optimized in the subtrees. In the above example, a user customized function opt_reconf is fed into contractor setup as a legal contraction path finder.
As mentioned earlier, there are three metrics to measure the quality of a contraction path. By setting different score functions (changing the minimize parameter), the resulting contraction path given by these contractors will exhibit different properties. Table 5 summarizes the contraction performance of different contraction strategies for our example case (n = 40, d = 6). As we can see, the cotengra optimizer and subtree reconfiguration can greatly improve the quality of the contraction path and improve the efficiency of quantum circuit simulation. For example, we gain more than a factor of two improvement in Table 5: Performance of different contractor settings which include the default opt_einsum contractor and cotengra contractors with and without subtree reconfiguration. Parameters in the brackets indicate the score function used during the path searching and reconfiguration. For cotengra contractors, we set max_repeats=1024, max_time=120 and method=["greedy","kahypar"]. "combo" means the score function is a combination of "flops" and "write" (flops+64× write by default). For subtree reconfiguration, we set num_trees =20, num_restarts=20. Results shown are from one run, and performance may vary from run to run since these algorithms are intrinsically random. simulation time and simulation space compared to the default contractor, and the degree of improvement can increase for larger system sizes.

Advanced automatic differentiation
Jupyter notebook: 6-6-advanced-automatic-differentiation.ipynb TensorCircuit provides backend-agnostic wrappers to a number of advanced AD features, useful in a variety of quantum circuit simulation scenarios. In the remainder of this section we will illustrate these using the following circuit example: 1 n = 6 2 nlayers = 3 By the chain rule, the Jacobian matrix of a composition of functions is the product of the Jacobians of the composed functions (evaluated at appropriate points). e.g. if h : R n → R p , g : R p → R q , f : R q → R m and y : R n → R m with y(x) = f (g(h(x))) then where a = h(x), b = g(a) and · denotes matrix multiplication. Forward mode AD and reverse mode AD ('backpropagation') are two approaches to computing a composite Jacobian, and differ in the order in which the products are computed. Forward mode AD computes the above product from right to left, i.e. J y = J f (b) · (J g (a) · J h (x)) at a cost of pqn + qnm multiplications. Taking f to be the output state ψ(θ) of the above circuit, this is computed as The relative efficiency of these methods depends on the input and output dimensions of the functions involved. For instance, when p = q forward mode AD is advantageous if n m (i.e., the input dimension is much smaller than the output dimension, corresponding to a 'tall' Jacobian) and vice versa for reverse mode AD.
Jacobian-vector product (jvp). Computing the product of the Jacobian with a vector v (i.e. the directional derivative) can be a useful primitive as it utilizes forward mode AD and is suitable when the output dimension is much larger than the input. For instance, setting v = e i (i.e. the vector with a 1 in the i-th coordinate and zeroes elsewhere) gives the vector of partial derivatives Quantum Fisher Information (qng). The Quantum Fisher Information (QFI) is an important concept in quantum information, and can be utilized in so-called quantum natural gradient descent optimization [64] as well as variational quantum dynamics [65,66].
There are several variants of QFI-like quantities, all of which depend on the evaluation of terms of the form ∂ i ψ|∂ j ψ − ∂ i ψ|ψ ψ|∂ j ψ . Such quantities are easily obtained with advanced AD frameworks, by first computing the Jacobian for the output state and then vmapping the inner product over Jacobian rows. The detailed efficient implementation can be found at the codebase. Here we directly call the corresponding API to obtain the quantum natural gradient. 1 from tensorcircuit . experimental import qng 2 3 # function to get qfi given circuit parameters 4 qfi_fun = K . jit ( qng ( psi ) ) which can then also be jitted to make multiple evaluations more efficient: Information on the Hessian matrix may be useful in investigating loss landscapes, or for second order optimization routines.
ψ| H |∂ψ . In variational quantum dynamics problems (see, e.g. [65]) one often wishes to compute quantities of the form This can be done via the stop_gradient API, which prevents certain parameters from being differentiated. Again, taking H = Z 0 , we can define an appropriate function for which only the parameters corresponding to |ψ(θ) will be differentiated: With the advanced automatic differentiation infrastructure, we can obtain quantum circuit gradient related quantities, such as those listed above, much more quickly than via traditional quantum software that utilizes parameter shifts to evaluate gradients. In Figure 10, we show the acceleration of QFI and Hessian computations compared to Qiskit. The benchmark code is detailed in examples/gradient_benchmark.py . From the data, we see that for even moderate-sized quantum circuits, TensorCircuit can achieve a speedup over Qiskit of nearly a million times. Background. Quantum computing is envisioned to be a powerful tool for quantum simulation and quantum chemistry tasks [67,68]. In the highlighted features below we show how to interface Tensor-Circuit with OpenFermion [69] to compute the ground state energy of an H 2 O molecule, while in the benchmarking section we compare the performance of TensorCircuit with other software for the VQE energy computation of a transverse field Ising model.

Highlighted features.
OpenFermion is an open-source Python package that provides an efficient interface between quantum chemistry and quantum computing. In particular, it provides a convenient method for generating molecular Hamiltonians and converting them into qubit Hamiltonians compatible with Figure 10: Acceleration factor for TensorCircuit over Qiskit for the evaluation of QFI and Hessian information. For 4 and 6 qubit systems, we use PQC comprising two blocks of CNOT+Rx+Rz+Rx gates, while for the larger systems we use four such blocks. The simulation runs on AMD EPYC 7K62 CPU 2.60GHz, and TensorCircuit results use the JAX backend. As the Qiskit running times are long (e.g., the 12-qubit QFI calculation takes more than 20,000s ∼ 5.5 hours) the acceleration factors presented above are based on only a single Qiskit evaluation. In contrast, the TensorCircuit running times are low enough (0.026s for 12-qubit QFI) that we average over multiple runs.
simulating in quantum circuits. To make use of this, we provide the tc.templates.chems.get_ps API, which provides an interface between TensorCircuit and OpenFermion, and converts the OpenFermion qubit Hamiltonian object into the Pauli structures and weights tensors used by TensorCircuit (see Section 6.2.1). The relevant code snippet used to generate the Hamiltonian representation in TensorCircuit via OpenFermion is shown below. Benchmarking. We provide some benchmark data for TFIM VQE evaluation here. Specifically, we compute the TFIM energy expectation value and corresponding circuit gradients (with respect to the circuit parameters). With TensorCircuit , we use the (potentially less efficient) explicit for loop to perform the Pauli string summation (see Section 6.2.2) to obtain a fair comparison, with our benchmarking focus on the efficiency of evaluating the PQC and its gradients. See benchmarks for benchmark setup and details. Results are summarized in Table 6. Note that while we consider the TFIM model in our benchmark tests, this only differs from molecular systems in the number and type of Pauli strings that appear in the Hamiltonian.
We do not perform similar benchmark tests with other common quantum simulator packages such as Qiskit, Cirq and ProjectQ, as these do not support automatic differentiation, relying instead on the parameter shift technique to evaluate gradients. As the computational complexity of this approach scales linearly with the number of variational parameters, this is inefficient for circuits with a large number of parameters. For example, using Qiskit with its built-in parameter-shift gradient framework, the task in   Table 6 (with n = 10, d = 3 takes 125s per iteration -more than 10 5 times slower than TensorCircuit . Similar results have already been presented in Section 6.6.

Quantum machine learning
Jupyter notebook: tutorials/mnist_qml.ipynb Background. Quantum and hybrid quantum-classical neural networks are popular approaches to NISQ era quantum computing, and both can be easily modelled and tested in TensorCircuit . In the highlighted features below, we illustrate how to build a hybrid machine learning pipeline in TensorCircuit , and in the benchmarking section we compare TensorCircuit with other quantum software for performing batched supervised learning on the MNIST dataset using a parameterized quantum circuit.
Highlighted features. Seamless integration of quantum and classical neural networks can be obtained by wrapping the TensorCircuit tc.Circuit object (with weights as input and expectation value as output) in a QuantumLayer, which is a subclass of the Keras Layer. The following code snippet shows how such a wrapper is implemented and used:    Benchmarking. We benchmark TensorCircuit against other software for binary classification ('3' vs. '6') of the MNIST dataset, using quantum machine learning with batched inputs. For software with only parameter-shift gradient support, each PQC must be evaluated O(np) times, where np is the number of circuit parameters. This is much slower than AD-enabled simulators, where only one evaluation of the PQC suffices to obtain all circuit weight gradients. Therefore, we only compare TensorCircuit with other AD enabled software such as TensorFlow Quantum and Pennylane (With Pennylane, we utilize its JAX backend simulator, and use jit and vmap tricks to increase performance). The TensorCircuit results use the default greedy contraction path finder, and further improvements are possible with customized contraction path finders. See benchmarks for full benchmark details and Table 7 for results. From the benchmarks on the standard VQE and QML task sets in the previous two examples, we see that TensorCircuit can indeed bring substantial speedups in quantum circuit simulation. The acceleration is more impressive on GPU, especially when the circuit size or the batch dimension is large.

Demonstration of barren plateaus
Jupyter notebook: tutorials/barren_plateaus.ipynb Background. The so-called barren plateaus phenomenon refers to gradients of random circuits vanishing exponentially quickly as the number of qubits or the circuit depth increases [13]. To demonstrate this numerically requires computing the circuit gradient variance over different circuit structures (i.e., choice of random gates in the circuit) and circuit weights. Gradients can be obtained via automatic differentiation, while the different circuit weights can be vectorized and jitted to boost performance. In addition, as in this example, the different circuit structures can also be vectorized and jitted to obtain further speed up.
Highlighted features. This example showcases how jit and vmap can be applied to different circuit architectures. The ability to vmap circuit structures was introduced in Section 6.3.2. Here, we give more details on how to encode different circuit structures via a tensor parameter as input. The core part of the circuit construction is as follows.

Pennylane (CPU) TensorFlow Quantum TensorCircuit (CPU) TensorCircuit (GPU) time (s)
155.28 6.24 0.12 0.011  The seeds tensor controls the circuit architecture via the unitary_kraus API. This API tells the circuit to stochastically attach one gate from Rx, Ry, Rz with probability 1/3 each. The status argument externalizes the random number generation (see Section 5.2.1). Namely, when status is less than 1/3, the first gate is applied, when status is in the range [1/3, 2/3], the second gate in the list is applied and so on. Therefore, by generating a random number array seeds = K.implicit_randu(size=[n_qubits, n_layers]), we can generate different random circuit s. The simulation over different architectures can be vectorized and jitted by generating seeds with an extra batch dimension corresponding to the number of different architectures you wish to compute in parallel.
Benchmarking. We benchmark the gradient variance computation over different random circuit weights and different circuit structures using TensorFlow Quantum, Pennylane and TensorCircuit , (see details in examples/bp_benchmark.py ). The gradient computational times for 100 random circuit constructions, each a 10-qubit, 10-layer circuit, are shown in Table 8. With 10 × 10 circuit weights and 100 different circuits, the combination of vmap and jit provides TensorCircuit with a more than five hundred times speedup over TensorFlow Quantum for this task. As Pennylane lacks the capability to batch compute the trainable parameters and different circuit structures, the gradient evaluations must be performed sequentially by a naïve loop, which leads to significantly longer times even on the fastest Pennylane-Lightning backend.
The choice of benchmark problem sets: The previous three subsections gave benchmark results for VQE, QML and the barren plateaus problems, respectively. These three tasks were chosen to showcase and benchmark the paradigms and features that quantum software can support. VQE problems can benefit from AD, QML problems can benefit from AD and VMAP on non-differentiable inputs, while the barren plateaus problem can be accelerated by AD as well as VMAP on differentiable inputs and on circuit structures. In all cases, JIT and GPU support (if compatible with AD and VMAP) can further increase the efficiency of the corresponding quantum simulations.  Table 9: Performance benchmarks (running time in seconds) for a large scale VQE task with different qubit counts. TensorCircuit results use the TensorFlow backend and run on Nvidia A100 40G GPU. Note that the VQE optimization hyperparameters were not tuned, so the energy accuracy obtained only represents a lower bound on the performance of the current method. Time for one step includes computing both the energy expectation value and the circuit gradients. The full optimization requires thousands of iterations, with the Adam gradient descent optimizer used initially, followed by an SGD optimizer in the later stages of the process.

Outlook and concluding remarks
We have introduced TensorCircuit , an open-source Python package, designed to meet the requirements of larger and more complex quantum computing simulations. TensorCircuit is built on top of, and incorporates, all the main engineering paradigms from modern machine learning libraries, and its flexible and customizable tensor network engine enables high-performance circuit computation.
Outlook. We will continue the development of TensorCircuit , towards delivering a more efficient and elegant, full-featured and ML-compatible quantum software package. At the top of our priority list are: With the continued rapid progress in quantum computing theory and hardware, our hope is that TensorCircuit , a quantum simulator platform designed for the NISQ era, will play an important role in the academic and commercial progress of this exciting field.