Representation of binary classification trees with binary features by quantum circuits

We propose a quantum representation of binary classification trees with binary features based on a probabilistic approach. By using the quantum computer as a processor for probability distributions, a probabilistic traversal of the decision tree can be realized via measurements of a quantum circuit. We describe how tree inductions and the prediction of class labels of query data can be integrated into this framework. An on-demand sampling method enables predictions with a constant number of classical memory slots, independent of the tree depth. We experimentally study our approach using both a quantum computing simulator and actual IBM quantum hardware. To our knowledge, this is the first realization of a decision tree classifier on a quantum device.


Introduction
Decision trees are well-known predictive models commonly used in data mining and machine learning for a wide area of applications [1][2][3]. In general, a decision tree can be viewed as a flowchart-like structure that can be used to query data. Starting from the root, each internal node represents a test on the query data and each outgoing branch represents a possible outcome of this test. For a binary tree, the test result is a Boolean value and can therefore be either true or false (i. e., there are two branches from each internal node). Each leaf of the tree can be associated with a decision. Therefore, the path from the root to the leaf implies a set of decision rules for the query data in the sense of a sequential decision process. In particular, we consider binary classification trees where the decision of a leaf determines the membership of a data point to a predefined discrete set of classes.
The inference of a decision tree from a given data set is a supervised machine learning task also known as decision tree induction (or decision tree learning). However, finding a globally optimal solution is NP-hard [4,5] and therefore heuristic recursion algorithms are favored in practice [6]. Such algorithms typically work in a greedy top-down fashion [7]: Starting from the root, the best test is estimated for each internal node by minimizing a data impurity function. The data set is splitted accordingly into two subsets along each of the two outgoing branches. This process is repeated recursively for each internal node until a stopping criterion terminates the tree traversal and results in a leaf with a classification decision based on the majority class present in the data subset within the node. The algorithm ends when all paths lead to a leaf. A heuristically created decision tree is not guaranteed to be globally optimal but might still be suitable for practical purposes.
In the context of quantum computing, decision trees can be assigned to the field of quantum machine learning [8]. Several previous papers consider the interplay between decision trees and quantum computing. In [9], the traversal speed of decision trees is studied and classical and quantum approaches are compared. The authors find no benefit of one over the other. An heuristic algorithm to induce quantum classification trees is presented in [10], where data points are encoded as quantum states and measurements are used to find the best splits. However, some parts of the hybrid quantum-classical algorithm seem to be incomplete and can be considered "enigmatic" [11]. In [12], a conceptional quantum extension is presented for the classical decision tree induction algorithm C5.0 [2] such that an almost quadratic speed-up can be gained by implementing a suitable quantum subroutine. However, there is no publication that we know of that considers both induction and evaluation of a quantum decision tree in a consistent manner. In this sense, "the potential of a quantum decision tree is still to be established" [11].
Decision trees have also been used to realize efficient data management strategies in a quantum computing context. In [13], a "bucket-brigade" quantum random access memory (qRAM) is proposed, which achieves an exponential reduction in access overhead compared to a conventional classical random access memory design. In this approach, three-level memory elements are structured in a binary tree to organize the flow of information to the memory cells in the leaves (and back). Moreover, a whole ensemble of decision trees can also be used to store structured data. In [14], binary decision trees are used as building blocks for a data structure to store a preference matrix for a quantum recommendation system. To this end, the preference matrix is decomposed into row vectors, which are then stored in a binary tree such that each leaf holds the amplitude of one vector component and each internal node stores the respective amplitude sums of its branches, which enables efficient computations.
The term "quantum decision tree" is also used in the context of query complexity analysis of quantum algorithms in analogy to the decision tree view of classical query complexity. Here, the quantum algorithm is treated as a sequence of unitary operators followed by a measurement [15][16][17][18]. However, such a configuration has a different purpose than a decision tree in the sense of a predictive model, which is why the term "quantum query algorithm" is also used instead. On the other hand, a classical decision tree query can in principle be expressed by a quantum query algorithm.
In this manuscript, we follow a different strategy and propose a quantum representation of decision trees, which we call Q-tree, by using the quantum computer as a processor for probability distributions. For this purpose, we first take a probabilistic perspective to describe the decisions in a (classical) decision tree as conditions of the probability distribution of the training data. This approach allows us to associate a marginalized conditional probability distribution to each leaf that describes the probability distribution to reach the leaf in a probabilistic traversal of the tree and the corresponding class label probability distribution, respectively. In this sense, we load a classical decision tree into a quantum circuit.
The set of conditions in the tree can also be parameterized to obtain a tuple of vectors that uniquely define the structure of the tree. Based on these parameters, an appropriate quantum circuit can be built that fully represents the probabilistic model of the tree. By performing repeated measurements, this Q-tree can then be used for a probabilistic traversal of the tree in order to sample from the probability distribution of a randomly reached leaf. This approach enables a truly random representation of the tree [19,20]. The results from such a random sampling can be used for a prediction of the class labels of query data. A similar, but slightly modified circuit is also able to provide predictions for uncertain query data. Moreover, a hybrid quantum-classical algorithm can be used for a tree induction by optimizing the corresponding parameters. We propose the usage of a genetic algorithm [21] for this purpose.
In short, the conceptional foundation for our approach is a straightforward processing of probability distributions in which we utilize the quantum computer to perform conditional marginalizations based on the parameterized tree structure. Related concepts have already be pursued, e. g., in [22][23][24] in the context of Bayesian networks.
The remaining paper is structured as follows. We start in Sec. 2 by describing the considered machine learning problem. In Sec. 3, we discuss preliminary considerations regarding purely classical decision trees, which are used in Sec. 4 to present our proposed Q-tree quantum representation. Subsequently, we demonstrate the usage of this quantum representation in Sec. 5 on a quantum computing simulator and actual quantum hardware. We close with conclusions and an outlook in Sec. 6.
internal nodes leaves T root Figure 1: Schematic sketch of a binary classification tree with binary features. Starting from the root, each path along the internal nodes ( ) to a leaf ( ) represents a set of decision rules for the data features in the form of Eq. (3) to predict the corresponding labels (i. e., to classify the query data). The outgoing branches represent the two cases xi = 0 (denoted by 0 ) and xi = 1 (denoted by 1 ), respectively. The depth d indicates the number of decision tests required to reach a node. We only consider trees with no repeating decision rules along a path and no missing internal nodes. The whole tree (of maximum depth 3 in this example) is denoted by T .

Problem description
We consider data points of the form d ≡ (x, y) with k-dimensional binary input features x ∈ B k and m-dimensional binary (class) labels y ∈ B m with B ≡ {0, 1}. Based on a given training data set consisting of t data points, a decision tree T can be inducted using an algorithm A such that The algorithm A can for example represent a global optimization method or a heuristic approach as discussed in Sec. 1.
We limit ourselves to binary classification trees with binary features T , which are also known as binary decision diagrams [25]. Starting from the root, such a tree consists of a set of nodes connected with branches, where each node has exactly one branch going in. Internal nodes have two branches going out, whereas leaves have no outgoing branches. In each internal node, the test is performed for a specific feature i ∈ {1, . . . , k} and the two outgoing branches represent the rules x i = 1 (i. e., test passed) and x i = 0 (i. e., test failed), respectively. Each node can consequently be associated with a data set representing a subset of the training data that fulfills the decision rules of the path from the root to the respective node. The depth d of a node indicates the number of decision tests required to reach it, where the root corresponds to d = 0. For reasons of simplicity, we limit ourselves to trees with the following two properties: 1. A feature cannot be split twice along a path, i. e., decision rules must not repeat.
2. All leaves have the same depth, i. e., the tree is not missing any internal nodes.
The first presumption implies that d < k such that in a tree of depth k − 1, all features have to be evaluated once to reach a leaf. At depth d, there are 2 d nodes, each with k − d possible splitting decisions, which in total sum up to d l=0 2 l (k − l) possible decisions along all paths up to this depth. A schematic tree is sketched in Fig. 1.
After its induction, a decision tree T of this form can subsequently be used to predict the labels y for a query vector of input features x (i. e., it can be used to classify the query data), which is in general not part of the training data set. However, instead of a crisp prediction value, each leaf of the tree T is labeled with the probabilities of predicting either the label y i = 0 or the label y i = 1 for all i ∈ {1, . . . , m} as given by the corresponding ratios of labels in the associated subset of the training data. In this sense, we consider probability estimation trees that can also be used to make fuzzy predictions [26]. Specifically, the tree T can predict the label probability distribution p(y) ∈ [0, 1] from features x according to which also allows to predict the most probable labels arg max y p(y) ∈ B m . It can be assumed that all data points from the training data set and all query data points are sampled from an underlying "true" (but usually unknown) probability distribution in the sense that d ∼p(x 1 , . . . , x k , y 1 , . . . , y m ) ≡p(x, y) (5) for each data point. The induction of a decision tree therefore corresponds to finding a representative estimation for this probability distribution based on the information contained in the training data set.

Classical representation
In this section, preliminary considerations regarding purely classical decision trees are presented. First, we describe how a decision tree can be parameterized by a tuple of vectors. This parameterization can the be used for the tree induction. Subsequently, we discuss the probabilistic view of decision trees, in which a marginalized conditional probability distribution can be assigned to each node based on the training data. This formulation can be directly realized with a quantum circuit and is therefore of central importance for our proposed Q-tree approach. Finally, we briefly explain how the prediction of query data fits into the probabilistic framework.

Tree parameterization
A binary decision tree, as shown in Fig. 1, has a highly symmetric structure, uniquely defined by the feature indices for which a splitting decision is performed in each internal node. We therefore present two possible tree parameterizations in the following, which are particularly useful for the subsequent considerations.

Decision configuration
The structure of a tree T can be parameterized as a tuple of vectors defining the tests in each node. For this purpose, we introduce the decision configuration for a tree of maximum depth d as a (d)-tuple of layer (i. e., depth-layer) decisions The vectors e i represents all decisions in depth i, which are defined by the feature indices e i j for which a split is made, one for each each node j ∈ {1, . . . , 2 i } in this depth, sorted in a specific order.
As a convenient sorting order, we propose that given a bit stringx 0 · · ·x l−1 of split decisions from the root (i. e., the single node with depth 0) to a node of depth l ≥ 1 corresponding to the decision rules x e 0 =x 0 ∧ · · · ∧ x e l−1 =x l−1 , the decision in this node is determined by e l ν with so that ν ∈ {1, . . . , 2 l }. Here we assume that a valid sequence of decisions x 0x1x2 = ν = Figure 2: Tree parameterization E, Eq. (6), for the tree from Fig. 1. For each internal node, we show the features for which the respective splitting test, Eq. (3), is performed. In addition, we also show the bit strings of split decisionsx0x1x2 for the leaves and the corresponding node index ν, Eq. (8). Any node in the tree can be uniquely identified by such a bit string corresponding to the decision rules along the path from the root to this node.
has been chosen (in the sense that this path is actually present in the tree T ). For the root, we define ν 0 ≡ 1. Thus, any node in the tree T can be identified by its depth l ∈ {0, . . . , d} and an index ν.
In a valid tree, all feature indices e i j must be chosen so that no repeated splits occur along any path. For a tree of depth d, this means that the condition has to be fulfilled by the corresponding decision configuration E. Summarized, the decision configuration E uniquely defines the structure of the tree T as an ordered nested sequence of d l=0 2 l feature indices in total. In addition to its structural information, T also contains information about the training data D train , Eq. (1), since each node can be associated with the subset of training data that fulfills the corresponding decision rules, which is particularly used in the leaves for the inference of the labels of query data. Combining E and D train , a one-on-one correspondence can be established. An example is sketched in Fig. 2.

Compressed decision configuration
In each layer of the tree, the sequence of decisions, Eq. (8), grows by one feature, which is equivalent to reducing the available number of split choices by one. However, this is not directly reflected by the encoding E, Eq. (6), in which the feature indices e i j can attain k possible values independent of the depth. We fix this defect by imposing an additional condition given by Eq. (10). It states that only such values of e i j are allowed that produce a tree without repeated splits. Consequently, additional dependencies between the elements of E emerge.
In the context of this work, it turns out to be more practical to switch to a dependency-free parameterization, which naturally reflects the reduction of choices by having less degrees of freedom in each layer. For this purpose, we introduce the compressed decision configuration for a tree of maximum depth d as a (d)-duple of independent vectors in analogy to Eqs. (6) and (7). The elements c i j (with j ∈ {1, . . . , 2 i }) are mutually independent and correspond to a valid tree without the need for additional constraints. There exists a bijective transformation between Eqs. (6) and (12) as we show in App. A such that decision tree T can be parameterized by either E, Eqs. (6) and (10), or C, Eq. (12). We consequently use E and C interchangeably.

Tree induction
The relations in Eqs. (11) and (14) allow us to rephrase Eq. (2) as D train , A → C (15) in the sense that the induction of a tree T based on the training data D train , Eq. (1), and an algorithm A can equivalently be considered as an induction of the corresponding compressed decision configuration C, Eq. (12). As an example for A, which is particularly useful for our quantum decision tree approach, we propose a genetic algorithm [21,27], which we briefly discuss in the following.
In general, a genetic algorithm is a gradient-free optimization strategy inspired by natural selection in which a population of candidate solutions is iteratively modified to find the best solution (among the set of evaluated candidates) that maximizes a given fitness function. Typically, a genetic algorithm is defined by three components: 1. A chromosome representation of the solution domain in the sense that each candidate can be uniquely identified by their chromosome.

2.
A set of genetic operators to alter the population of candidate solutions (typically in the form of selection, crossover, and mutation).
3. A fitness function, which can be evaluated for each candidate solution.
The resulting solution is only the best solution from the set of evaluated candidates and it cannot be generally guaranteed that it is also globally optimal. In our case, the goal of the genetic algorithm is to find a decision tree T of a predefined maximal depth d as its solution. Therefore, a possible chromosome representation of a candidate solution is given by the vector consisting of the swap indices c i j of the corresponding compressed decision configuration C, Eq. (12). The elements of C, which are also called chromosome attributes, are mutually independent and constrained by lower and upper bounds as defined in Eq. (13). These limits define the solution domain in such a way that C can represent any possible tree of maximal depth d. Furthermore, the fitness function can be any map of the form and should be chosen in such a way that ordering solutions by their "quality" corresponds to ordering them by their fitness. A genetic algorithm with these ingredients can be used to find the approximation of the solution depending on its hyperparameters θ. In App. B, we propose a concrete realization based on this conceptional framework.

Probabilistic view
Since the training data D train , Eq. (1), is sampled from a probability distributionp(x, y), Eq. (5), we can conversely use the probability distribution based on a histogram of the samples to estimate p(x, y) ≈p(x, y), (21) where | · | denotes the set cardinality. This estimated probability distribution can be used to establish a probabilistic view of a decision tree by switching from assigning subsets of the data to each node to assigning the corresponding histogram-based probability distributions. However, instead of recounting the histogram for each node, we follow a mathematically equivalent path by successively applying Bayes' rule to construct conditional probability distributions according to the decision rules, as described in the following. A related discussion regarding decision trees and probability distributions can also be found in [28]. Given a tree T , we can obtain the corresponding training data D train and decision configuration E according to Eq. (11). The root (where no previously applied decision rules are present) can be directly associated with p(x, y). The corresponding label probability distribution can be found by marginalizing over all features x. Here and in the following, a sum over binary variables iterates over B for each variable.
To obtain the probability distributions associated with deeper nodes, the splitting decisions have to be taken into account. For example, the first split of the tree with respect to the feature e 0 1 ∈ {1, . . . , k} corresponds to a conditioning on x e 0 1 so that the probability distributions assigned to the two branched nodes read and respectively. In analogy to Eq. (22), the corresponding label probability distributions are given by and p(y | x e 0 1 = 1) = respectively, where we have made use of the abbreviation for a set of indices S ⊆ {1, . . . , k}. Here and in the following, sets as arguments of a probability distribution are to be understood in such a way that the corresponding elements are used as arguments to this probability distribution in an appropriate order. Any node of depth l in the tree T can be identified by its index ν, Eq. (8). Instead of using the index, the node can also be identified by a set of conditions with feature valuesx i ∈ B for all i ∈ {0, . . . , l − 1} by following the corresponding sequence of decisions {e 0 , . . . , e l−1 }, Eq. (9). Consequently, any node can be associated with a marginalized conditional probability distribution with the probability distribution of the decision set p(C ν ) = x \{e 0 ,...,e l−1 } ,y p(x \{e 0 ,...,e l−1 } , C ν , y) (30) and the corresponding label probability distribution which can also be marginalized to get the probability distribution of the ith label with i ∈ {1, . . . , m}. Thus, we have established a probabilistic relation between the training data D train and the tree structure E, Eq. (6). A probabilistic tree traversal corresponds to drawing a samplex ∈ B k from which is based on Eq. (21), and subsequently using the decisions C ν in the tree T to reach a certain node. The probability distribution of reaching this node is then given by p(C ν ) and the corresponding label probability by p(y | C ν ). Equivalently, the probabilistic tree traversal can also be understood as drawing a samplex ∈ B k andȳ ∈ B m directly from which allows to infer p(C ν ) and p(y =ȳ | C ν ), respectively.

Tree predictions
Given a tree T of maximal depth d, the prediction of query data x q ∈ B k in the sense of Eq. (4) can be performed by an iterative traversal of the tree. Starting from the root, the index of the traversed node of depth 1 (after the first split with feature index e 0 1 ) is given by the index of the subsequently traversed node of depth 2 (after the second split with feature index e 1 q1 ) by and so on, where we recall Eq. (8). Generally, the the index of the traversed node of depth l > 1 reads until for l = d the leaf with index q d is reached. The traversed path imposes the set of conditions according to Eq. (28). Consequently, the predicted probability distribution for ith label with i ∈ {1, . . . , m} readsp where we recall Eq. (32). This expression also allows to infer the most probable label aŝ where we recall Eq. (32).

Quantum representation
So far, our considerations have been completely classical. In this section, we propose Q-trees as a quantum representation of binary classification trees with binary features using quantum circuits [29]. Our considerations are mainly based on the probabilistic perspective on decision trees from Sec. 3.3. Specifically, we presume that there is a tree T of maximum depth d, which is uniquely defined by the training data D train , Eq. (1), and the compressed decision configuration C, Eq. (12). Any leaf of T can be associated with a probability distribution, Eq. (29), that describes the probability distribution of D train , Eq. (21), conditioned by the decisions along the path from the root to the leaf. Based on this perspective, we can construct a quantum circuit that performs a random traversal of the tree in the sense that it samples the label probability p(y | C ν ), Eq. (31), of a random leaf that is selected with probability p(C ν ), where C ν , Eq. (28), denotes the set of decisions that identifies a path in the tree. We achieve this by preparing the probability distribution of p(x, y), Eq. (21), in terms of qubit amplitudes and subsequently apply a set of (conditional) SWAP gates to alter the amplitudes according to the tree structure. The final projective measurement over a subset of the qubits then yields a sample from p(C ν , y =ȳ), Eq. (34).
In the following, we explain this approach in more detail. We start by describing the corresponding quantum circuit and how it can be constructed from a given tree. In the subsequent part, we briefly discuss how tree inductions and label predictions of query data can be performed within this framework.

Tree circuit
A Q-tree representing the classical tree T can be realized by using the circuit layout sketched in Fig. 3. This Q-tree circuit, which we also refer to as Q-tree for short, contains k + m qubits in total. The first k qubits (denoted by |x 1 , . . . , |x k or their respective indices 1, . . . , k) represent the features, whereas the next m qubits (denoted by |y 1 , . . . , |y m or their respective indices k + 1, . . . , k + m) represents the labels.
Initially, all qubits are assumed to be prepared in the ground state In the next step, two unitary operators are applied. The first is responsible for the encoding of the training data and the second for the encoding of the structure (i. e., the decision configuration) of the tree. Finally, a projective measurement on the first d + 1 qubits and the last m qubits is performed. We discuss the components of the circuit in more detail below.

Data encoding
The first unitary operatorÛ data (D train ) in the Q-tree, Fig. 3, is responsible for an encoding of the training data D train , Eq. (1), in form of quantum amplitudes [30]. By definition, it acts on all k + m qubits and leads to the state where we recall p(x, y), Eq. (20). The encoded state |Ψ (D train ) contains all k + m (possibly entangled) qubits. This data encoding, where features are encoded in qubits and their respective probabilities in the state amplitudes such that the amplitude vector represents a classical discrete probability distribution, is also referred to as qsample encoding [22,31,32]. Classically, storing the total probability distribution requires O(2 k+m ) classical memory slots. Since only k + m qubits are required to store the complete information from p(x, y), the qsample encoding achieves an exponential compression. To obtainÛ data (D train ) for a given D train , classical algorithms can be used [33][34][35]. Recent methods include black-box state preparation without arithmetics [36,37] and quantum generative adversarial networks [38][39][40][41]. We do not further discuss these approaches here and instead assume that an appropriate data encoding operator is available to realize Eq. (42). In general,Û data (D train ) consists of O(2 k+m ) gate operations [42] but can be simplified if the probability distribution is not fully correlated [22]. The complexity can be further reduced if ancilla qubits are used [43][44][45], an approach which we do not further pursue in this manuscript.

Structure encoding
It is well-known that quantum circuits with a qsample encoding can be used to manipulate probability distributions, for example to perform marginalizations [22,32]. This circumstance can be exploited to construct a unitary operator with a conditional mapping of the initial probability distribution of the data to marginalized conditional probability distributions in individual leaves of the tree. This is the conceptional idea behind the structure encoding (or splitting decision encoding), which we explain in more detail in the following.
The second unitary operatorÛ struct (C) in the Q-tree, Fig. 3, is responsible for this structure encoding according to It only acts on the first k qubits and is determined by the compressed decision configuration C, Eq. (12). Specifically, each layer swap c i with i ∈ {0, . . . , d}, Eq. (13), can be associated with 2 i (multi-controlled) SWAP gates, which are partially "decorated" by NOT gates. The NOT gates are added in such a way that each combination of decorated and non-decorated controlled qubits is placed once, each assigned to one element c i j of c i with j ∈ {1, . . . , 2 i }. This sequence of gates can be expressed bŷ for each layer swap. The products in Eqs. (44) and (45) are ordered products in the sense of b i=aÔ i ≡Ô bÔb+1 · · ·Ô a−1Ôa (46) for a set of unitary operatorsÔ i for i ∈ {a, . . . , b} and a ≤ b. Furthermore, we introducê with c ∈ {i + 1, . . . , k}. It contains the abbreviationŝ with v ∈ {1, . . . , d}. The latter also makes use of The unitary operatorsμ i j andγ i (c) represent one of four standard gates [29]: • An identity gate, denoted by 1, which has no effect on the state.
• A NOT or Pauli X gate acting on the qth qubit with q ∈ {1, . . . , k}, which is denoted by NOT(q).
• A (multi-)controlled SWAP gate with respect to two target qubits q 1 and q 2 to swap (with q 1 , q 2 ∈ {1, . . . , k}) and v ≥ 1 control qubits defined by the non-empty set Q ∈ {1, . . . , k} v , which is denoted by C v SWAP(q 1 , q 2 , Q) and acts on v + 2 qubits. C 1 SWAP represents a CSWAP or Fredkin gate. Furthermore, for identical swapping qubits q 1 = q 2 , one has C v SWAP(q 1 , q 1 , Q) = 1. To simplify our notation, we also write The operations inÛ struct (C) can consequently be considered as a sequence of these standard gates.
The NOT, SWAP, and (multi-)controlled SWAP gates are understood to act as identity gates (i. e., 1) on the remaining qubits. An example circuit is outlined in Fig. 4. It shows the connection between the structure of a decision tree T (which is parameterized by the decision configuration E, Eq. (6), and the compressed decision configuration C, Eq. (12), with the correspondence from Eq. (14)) and its Q-tree encoding via Eq. (44). The quantum representation of the tree structure of all nodes of depth d requires up to d2 d NOT gates and up to 2 d C d SWAP gates. However, at least gates are consecutively applied to the same qubit and therefore eliminate each other.
To estimate the complexity of the structure encoding circuit, we decompose it into one-and two-qubit gate operations [42]. For this purpose, we make use of the relation [46] C x e 2   (45), which in turn are sequences of NOT andγ i (c) gates. The latter represent a SWAP gate or a CiSWAP gate as defined in Eq. (49). We denote NOT gates with X and show the individual qubits affected by the SWAP and CiSWAP gates: control qubits are marked by α and α , respectively, whereas qubits to swap and their potential swapping partners are marked by β and β , respectively. Thus, one hasγ 0 (c 0 The parameters α, α , and β are determined by d, whereas β is determined by the corresponding swap index c i j . Directly consecutive NOT gates eliminate each other and can therefore be omitted, but we nevertheless show them for the sake of completeness. for q 1 = q 2 with v ≥ 0 control qubits, which is also sketched in Fig. 5. Here, C v NOT(q, Q) denotes a (multi-)controlled NOT gate acting on qubit q with v ≥ 1 control qubits defined by the non- is also called (multi-controlled) Toffoli gate. The C v NOT gate can be further decomposed in various ways [47], depending on whether additional ancilla qubits are to be used or not. We particularly consider an ancilla-free decomposition into 8v − 20 two-qubit controlled-rotation gates [48]. Thus, the structure encoding of a tree T with depth d requires a total of O(d2 d ) NOT gates and O(d) two-qubit controlled-rotation gates, resulting in an effective complexity of O(d2 d ) gates.
As visualized in Fig. 4, each unitary operatorΓ i j (c), Eq. (47), can be associated with a node of depth i and index j in T . According to the definition of the operator, NOT gates appear only in pairs and therefore effectively only C i SWAP gates are applied. In particular, the NOT decorations are exclusively attached to control qubits before and after a C i SWAP and are therefore responsible for flipping the controls temporarily. The placement of NOT gates is determined by κ(j, v, u), Eq. (51), which decides whether the uth qubit acts as an open control gate (for κ(j, v, u) = 0) or a closed control gate [46]. The application ofΓ i j (c) to an arbitrary entangled k-qubit state x a(x) |x with amplitudes a(x) consequently yieldŝ for i ≥ 1 and, without loss of generality, i+1 < c. Clearly, the only transformation of the amplitudes is a conditional swap of indices. In case of i = 0, the index swap is performed unconditioned and for i + 1 = c, one has a Γ (x) = a(x).

Measurement
The projective measurement at the end of the Q-tree yields a bit strinḡ as shown in App. C, where we recall C ν = C ν(x1,...,x d+1 ) , Eq. (28), and p(C ν , y =ȳ), Eq. (34). Thus, we can infer the probability distribution of reaching a certain leaf, Eq. (30), from and the label probability distribution, Eq. (31), from respectively. In addition, we can also obtain the probability distribution of the ith label with i ∈ {1, . . . , m}, Eq. (32). Summarized, we have shown that measuring the Q-tree (i. e., the circuit shown in Fig. 3 with k + m qubits and a complexity of O(2 k+m + d2 d ) gates) corresponds to drawing a sample from the conditional probability distribution p(C ν , y =ȳ), Eq. (34), that is associated with the randomly selected leaf of depth d and index ν of the respective decision tree T . The probability that a certain leaf is selected is given by the corresponding probability distribution p(C ν ), Eq. (30). Consequently, the Q-tree can be considered a quantum representation of T .
Conversely, from multiple measurements we can estimate p(C ν ) and p(y i |C ν ), respectively. For this purpose, an ensemble of bit strings {b 1 , . . . , b N } is collected from a sequence of N measurements. For each bit string, the corresponding set of conditions C ν , Eq. (28), is extracted from the fist d + 1 bits and the corresponding labelsȳ from the next m bits. The number of measurements n(C ν ) that fulfills the set of conditions C ν is collected and divided into 2m bins. First, m bins . . , m}, respectively. Based on these measurement results, we can obtain the approximations and respectively. For n(C ν ) = 0 (i. e., none of the measured bit strings fulfills the set of conditions C ν ), Eq. (61) is undefined. The (statistical) uncertainties of Eqs. (60) and (61) are discussed in App. D. The sampling from measurements corresponds to a truly random tree traversal the sense that it relies on intrinsic quantum randomness [49]. If we assume that the training data D train is sufficiently large such that Eq. (21) holds true, such samples are in fact approximately drawn from the underlying distributionp(C ν , y =ȳ), Eq. (5). In the following, we briefly explain how we can use the proposed setup to induce trees or to predict the labels of (uncertain) query data.

Tree induction
To induce a Q-tree in the sense of Eq. (15), we consider it as a parameterized (or variational) circuit [39]. Specifically, the Q-tree is according to Eqs. (42) and (43) parameterized by the training data D train , Eq. (1), and the compressed decision configuration C, Eq. (12). Since the training data is supposed to be immutable, the variable parameters are represented only by C. Consequently, a variational quantum algorithm can be used to optimized the parameterized circuit according to some optimization goal [50]. Following this strategy, we employ a genetic algorithm in the same way as for the classical tree. For this purpose, we replace the fitness function, Eq. (18), bŷ where {b 1 , . . . , b N } represents an ensemble of N bit strings, Eq. (55), collected from a sequence of N measurements of the Q-tree. Due to the noisy nature of the measurement results from quantum circuits,F (b 1 , . . . , b N , D train ) can in fact be considered a noisy fitness function [51]. A concrete example for Eq. (62) is proposed in App. B. The use of genetic algorithms in the context of quantum computing is a well-known field [52][53][54]. Recent results particularly highlight the potential of evolutionary methods for parameterized circuit optimization in the presence of noisy quantum hardware [55][56][57]. However, there are still unresolved practical challenges such as barren plateaus [58].

Tree predictions
To predict the labels of query data x Q ∈ B k , we assume that the estimated label probability distributionp(y i | C ν ), Eq. (61), is known for all ν ∈ {1, . . . , 2 d } from a previous sampling. In this case, there are two alternative prediction approaches, which we present in the following.
The first prediction approach corresponds to a classical evaluation of the tree in the sense that the knowledge about the decision configuration E, Eq. (6), allows a conditional traversal of the tree according to the elements of x q until a leaf is reached. Thus, the predicted probability distribution for the ith label with i ∈ {1, . . . , m} readŝ in analogy to Eq. (39), where we recall C q d (x q ) , Eq. (38). Furthermore, we can usê in analogy to Eq. (40) to predict labels. The second prediction approach corresponds to a semi-quantum evaluation of the tree. This approach allows to process a probability distribution of query data with support B k instead of a single data point, which enables the consideration of uncertainty in the query data. For this purpose, the evaluation of the quantum circuit outlined in Fig. 6 is required, which we refer to as query circuit. It contains k qubits, which are assumed to be initially prepared in the ground state |0 , Eq. (41), and consists of two consecutive unitary operators followed by a projective measurement of the first d+1 qubits. The first unitary operatorÛ query (p q ) is responsible for a qsample encoding of p q (x q ) in analogy toÛ data (D train ), Eq. (42), whereas the second unitary operator corresponds toÛ struct (C), Eq. (44), and is therefore responsible for the structural encoding of the tree T . In particular, p q can also represent a single query data point x q by setting p q (x q ) = 1.
The qsample data encoding can in this case be achieved straightforwardly by applying a suitably chosen array of NOT gates. A measurement yields the a strinḡ with and We obtain an ensemble {b 1 , . . . , b N } of such bit strings from N measurements and count the number of measurements n(C ν ) that fulfill a set of conditions C ν , Eq. (28), such that Cν n(C ν ) = N . Thus, we can estimate the probability distribution of reaching a certain leaf Figure 6: Query circuit. Layout for the prediction of labels for a probability distribution of query data p q (x q ), Eq. (65), using a Q-tree, Fig. 3. The circuit contains k qubits and consists of the unitary operator Uquery(p q ) for the qsample encoding of p q (x q ), the unitary operatorÛstruct(C), Eq. (44), for the structural encoding of the tree T and a subsequent projective measurement of the first d+1 qubits. The measurement result, Eq. (67), can be used for the prediction of labels of uncertain query data, Eq. (70). in the sense of Eq. (30). The predicted probability distribution for ith label with i ∈ {1, . . . , m} consequently readŝ which represents an average ofp(y i | C ν ) over all leaves in the sense of an approximation where we recall Eq. (39). The labels can be obtained in analogy to Eq. (64). The (statistical) uncertainties of Eqs. (63), (69) and (70) are discussed in App. D. The disadvantage of these two prediction methods is that they presume that the whole estimated label probability distributionp(y i | C ν ), Eq. (61), is available beforehand, which requires O(2 d ) classical memory slots. As an alternative, we propose an on-demand sampling prediction method to reduce this requirement to O(1) classical memory slots. This method consists of four steps, which are sketched in Fig. 7.
First, a probability distribution of query data p q (x q ), Eq. (65), is prescribed, which can in particular also represent a single query data point. Second, the corresponding query circuit, Fig. 6, is measured N times to sample the probability distribution of reaching a certain leafp q (C ν ), Eq. (69). All node indices of non-zero elements of this probability distribution are collected in the set where ν iterates over all leaves. For a single query data point, one has |Λ(p q )| = 1 such that the set of corresponding non-zero elements requires O(1) classical memory slots (instead of O(2 d ) when also storing the vanishing elements). Third, the Q-tree of interest, Fig. 3, is measured N times to sample the predicted probability distribution for the ith labelp(y i | C ν ), Eq. (63), in such a way that only leaves present in Λ(p q ) are considered. That is, for a single query data point, only O(1) classical memory slots are required to store the resulting set of entries  (73) and (74) to perform the prediction of the prescribed distribution of query data. In total, O(1) classical memory slots are required for a single query data point. Therefore, an exponential data compression can be achieved for the prediction similar to the exponential data compression for the data encoding from Sec. 4.1.1. The disadvantage of this approach is, of course, that the sampling must be performed anew for each query data based on the evaluation of the query circuit. Thus, the reduction of classical memory is traded off for additional quantum computations. For a sufficient number of measurements, the on-demand sampling leads to the same predictions as the presampled approach. On noisy quantum hardware, however, the predictions might deviate. For example, one might collect measurements from the query circuit such that |Λ(p q )| > 1 despite considering only a single query data point. In this case, choosing the largest probabilitỹ could serve as a suitable method for noise mitigation. A more detailed discussion of this topic goes beyond the scope of this paper, but can be considered as a possible research direction.

Experiments
In the present section, we test the proposed Q-tree approach for quantum representations of decision trees in practice. As our study example, we consider two data sets. First, the tic-tac-toe endgame data set [59] from the UCI machine learning repository [60], which we use for experiments on a quantum computing simulator (on classical hardware), and second, a simple toy data set, which we use to run experiments on actual quantum hardware. In the following, we start by describing the data. Subsequently, we present the experimental results on the simulator and the hardware, respectively.

Data
We consider two data sets for our experiments. The first data set is the tic-tac-toe endgame data set [59], which contains the complete set of possible board configurations at the end of tic-tac-toe games. Each of the 958 instances corresponds to one legal tic-tac-toe endgame board. A board is encoded by a nine-dimensional vector of ternary attributes x ∈ {0, 1, 2} 9 , which determine whether each of the nine board fields is taken by one of the two players or left blank (0 represents starting player, 1 represents the other player and 2 represents a blank field) with the allocation shown in Fig. 8. We convert x into a 15-dimensional feature vector x ∈ B 15 (i. e., k = 15) according to In addition, a binary label y ∈ B 1 (i. e., m = 1) is assigned to each board instance to decide the player who won the game (y 1 = 1 represents a victory of the starting player and y 1 = 0 a victory of the other player), where a victory is achieved by a "three-in-a-row." The second data set we consider is a toy data set, which consists of five seven-dimensional binary feature vectors (i. e., k = 7) and five corresponding binary labels (i. e., m = 1). In total, the data set reads where each row represents a data point and each column a feature vector and label, respectively. By definition, y 1 = 1 if x contains an odd number of ones and y 1 = 0 otherwise (i. e., the label represents an exclusive or operation over all features). In particular, the label depends only on the first three features, whereas the last four features carry no information.

Simulator
We use the quantum computing simulator Qiskit (Quantum Information Software Kit) [61] to evaluate quantum circuits on classical hardware. In particular, we empirically test Q-trees using the data-based decision problem to predict the winner of a tic-tac-toe match y ∈ B 1 based on the board configuration x ∈ B 15 as described in Sec. 5.1. For this purpose, the tic-tac-toe endgame data set is randomly split into a training data set D train , Eq. (1), and a test data set D test of equal size (i. e., 479 instances each) such that both splitted data sets contain the same proportion of labels. We use the training data set to induce decision trees and the test data set to verify the respective tree performances as explained in the following in more detail.
To begin with, we induce a decision tree of maximum depth d = 3 as described in Sec. 4.2 using the genetic algorithm from App. B with the hyperparameters θ = (16, 20, 3, 0.3, 0.5, 0.15, 1, 1), Eq. (101), and N = 10 6 measurements. We repeat the induction 25 times with different random seeds. As a result, we obtain 25 Q-trees QT 1 sim , . . . , QT 25 sim (each parameterized by a corresponding compressed decision configuration, Eq. (12)). The mean distribution of splitting feature indices in each depth is shown in Fig. 9a. We find that the splitting index 15 occurs most often for the root, whereas the most often occurring splitting indices over all depths are 1 and 15.
A final sampling with N = 10 6 measurements is performed for each Q-tree to obtain an estimate for the label probability distributionp(y i | C ν ), Eq. (61), for all leaves ν, Eq. (8), with the corresponding set of conditions C ν , Eq. (28). Based on this estimate, predictions of query data can be performed using Eq. (63). In case that we find zero samples for a particular leaf (i. e., n(C ν ) = 0), the probability distribution p(y 1 | C ν ), Eq. (61), is undefined. To be able to still make predictions, we assign the agnostic default value p(y 1 | C ν ) = 1 2 . Furthermore, to resolve the resulting ambiguity ofŷ 1 , Eq. (64), we instead predict the majority label present in the training data set.
For comparison purposes, we employ scikit-learn [62] to induce a classical decision tree T cl of the same maximum depth using a top-down approach with information entropy as a splitting criterion. The resulting tree is shown in Fig. 10b. The corresponding distribution of splitting feature indices in each depth is visualized in Fig. 9b. Clearly, there is a close resemblance between the distributions in Fig. 9a and Fig. 9b, but there are also certain differences. For example, the splitting index for the root is 15 in Fig. 9b, which corresponds to the majority of splitting indices in Fig. 9a. On the other hand, the most often occurring splitting indices over all depths in Fig. 9b are 1, 4, and 8 as opposed to 1 and 15 in Fig. 9a. Here and in the following, we use the definition imposed by  scikit-learn that balanced label probabilities p(y 1 = 0 | C ν ) = p(y 1 = 0 | C ν ) = 1 2 , Eq. (31), lead to a prediction y 1 = 0 for T cl .
As a primary performance metric for the decision trees, we consider the balanced accuracy bac ≡ 1 2 tp 0 tp 0 + fn 0 + tn 0 tn 0 + fp 0 = 1 2 which is based on the number of true positives the number of true negatives the number of false positives and the number of false negatives respectively, where b ∈ B denotes the labels treated as positives and D the data set of interest.
Here,ŷ 1 (x q ) ∈ B represents a prediction of the label query data x q ∈ B 15 . For the Q-tree we use Eq. (64) to perform predictions. To obtain the "best" decision tree out of our induced sequence of Q-trees, we choose the tree with the highest balanced accuracy with respect to the training data set D train . The resulting tree QT sim is associated with the decision configuration C sim = ((15), (10, 2), (6,14,15,8), (12,15,15,9,9,6,9,4) (63). The corresponding bit stringx0x1x2x3 represents the traversed path in the tree, Eq. (38). We indicate for each example whether the prediction is correct or incorrect using Eq. (40).  Fig. 11. Two correct and two incorrect predictions are used for this purpose. Since the predictions are performed on the basis of previously sampled probability distributions, no additional quantum computations are necessary. In addition, we also demonstrate two uncertain query predictions usingp qm (y i | p q ), Eq. (70), in Fig. 12. They require additional quantum computations for the estimation ofp q (C ν | p q ), Eq. (69). The predicted probability distribution from the sampling, Eq. (70), coincides with the true probability distribution calculated from the data proportions, Eq. (71). However, since the two uncertain queries represent a mixture of boards with opposite labels, we cannot assess whether or not the resulting predictions are correct in the sense of a correct classification. Next, we perform a direct comparison of the classification performance between QT 1 sim , . . . , QT 25 sim and T cl . In addition to the balanced accuracy, Eq. (78), we also consider the unbalanced accuracy acc ≡ tp 0 + tn 0 tp 0 + tn 0 + fp 0 + fn 0 = tp 1 + tn 1 tp 1 + tn 1 + fp 1 + fn 1 , the precision (fraction of relevant instances among the retrieved instances) and the balanced F-score (harmonic mean of precision and recall) respectively, with b ∈ B, where we recall Eqs. (79) to (82). The performance metrics are determined for the training and test data set (i. e., D ∈ {D train , D test }).
The results are listed in Tab. 1. We find that for the training data, certain best metrics of the Q-trees (acc, pre 0 , rec 1 , and f1 1 ) are superior to the corresponding metrics of the classical tree T cl , whereas others (bac, pre 1 , rec 0 , and f1 0 ) are inferior. For the test data, the best metrics of the Q-trees are always superior or at least as good as the corresponding metrics of the classical tree. In all cases, the mean Q-tree metrics and the metrics of the exemplary Q-tree QT sim , Eq. (83), are worse than or equal to the corresponding metrics of the classical tree.
To quantify the performance of the tree induction, we consider a set of randomly generated decision trees. For this purpose, we uniformly draw 25 decision configurations, Eq. (12). Subsequently, we perform a sampling with N = 10 6 measurements for each corresponding Q-tree to obtain an estimate for the label probability distribution. We determine the balanced training and test accuracy, Eq. (78), for each tree and obtain a mean and standard deviation (in brackets) of 0.53(4) for both test and training data. This result is significantly worse than the mean balanced accuracy over all induced trees listed in Tab. 1. We consequently conclude that our induction method is clearly superior to a random selection of trees for this particular example.
Finally, we compare the effect of using different numbers of experiments N ∈ {10 3 , 10 4 , 10 5 , 10 6 } to estimate the probability distribution of the decision setp(C ν ), Eq. (60), and the corresponding label probability distributionp(y i | C ν ), Eq. (61), for the Q-tree QT sim . We compare the estimates with the exact probability distributions based on data proportions p(C ν ), Eq. (30), and p(y | C ν ), Eq. (31), respectively, and also consider the statistical uncertainties as discussed in App. D.
The results are shown in Fig. 13. As expected, the sampled probabilities converge to the exact results for a sufficiently high number of experiments, whereas their standard deviations decrease with an increasing number of experiments. Moreover, the standard deviations of vanishing probabilities also vanish.

Quantum hardware
A major challenge of quantum algorithms is to run them on noisy intermediate-scale quantum (NISQ) devices [63]. To perform quantum hardware experiments, we remotely access a physical Label probabilities from Q-tree samples (61), with standard deviation (one sigma) σp (k, m, N ), Eq. (128), and from exact calculations   quantum computer via Qiskit using the cloud-based quantum computing service provided by IBM Quantum [64]. This service allows online requests for quantum experiments using a high-level quantum circuit model of computation [65]. The quantum experiments are then realized and executed sequentially on physical hardware that operates on superconducting transmon qubits. Specifically, we access the quantum computer ibmq ehningen (version 2.2.1), which represents a so-called Falcon r4 processor with 27 qubits and a quantum volume [66] of 32.
A quantum experiment is in this context defined as a circuit to be evaluated on the quantum computer, whereas the returned result consist of a histogram of counts for all measured qubits. The quantum computer we use is only capable of executing a specific set of gates (NOT, R z , √ NOT, and CNOT) and has a limited connectivity of qubits (i. e., CNOT gates can only be applied to specific pairs of qubits) as sketched in Fig. 14. In a process called transpilation, Qiskit can algorithmically translate a general circuit into one that can be run on a specific hardware. We use this functionality to realize our quantum experiments for the Q-tree. Furthermore, we make use of Qiskit's built-in state preparation approach [34] to find the data encoding operator U data (D train ), Eq. (42).
In total, a Q-tree of maximum depth d consists of k + m qubits and exhibits a complexity of O(2 k+m + d2 d ) gates. It is possible to reduce this complexity by only encoding data in such qubits that are actually measured. This effectively corresponds to performing the marginalization of p(x, y), Eq. (21), beforehand and leads to a circuit consisting of d+1+m qubits and O(2 d (d+2 m )) gates. In addition, the first SWAP gate can be omitted by a suitable re-ordering of the data (i. e., by assigning the corresponding feature of the root decision to the first qubit). The data preparation consequently depends on the tree structure in these cases. To simplify our circuits, we make use of these complexity reductions in the following.
The tic-tac-toe endgame data set considered in Sec. 5.2 for the simulator turns out to be too complex for the actual quantum hardware. Therefore, we instead employ the simpler toy data set, Eq. (77), where we use the total data set for both training and testing purposes. In analogy     77), where all quantum computations have been performed on actual quantum hardware. We use the same notation as in Fig. 10.
to our simulator experiments, we start by inducing a decision tree of maximum depth d = 1 as described in Sec. 4.2 using the genetic algorithm from App. B. We use the hyperparameters θ = (31, 6, 4, 0.5, 0.5, 0.5, 1, 1), Eq. (101), and N = 8192 measurements (which is the maximum number of measurements -so-called shots -that can be performed on the chosen hardware in a single run). We repeat the induction 10 times with different random seeds and consequently obtain 10 Q-trees QT 1 qc , . . . , QT 10 qc . Again, we perform a final sampling with N = 8192 measurements to obtain an estimate for the label probability distributionp(y i | C ν ), Eq. (61). In addition, we use scikit-learn to induce a classical decision tree T cl of the same maximum depth for comparison purposes, which is shown in Fig. 15b.
We visualize the mean distribution of splitting feature indices in each depth in Fig. 16 and find a close resemblance between the distributions in Fig. 16a and Fig. 16b. Furthermore, the first three splitting indices, which refer to informative features, are chosen over the remaining splitting indices, which refer to uninformative features (in all except for one case in Fig. 16a). The most often occurring splitting index over all depths is 2 in Fig. 16a, whereas in Fig. 16a the splitting indices 2 and 3 are both occurring once in the tree.
We choose the Q-tree with the highest balanced accuracy, Eq. (78), for demonstration purposes. This "best" Q-tree QT qc is associated with the decision configuration and shown in Fig. 15a. The structure of this Q-tree is particularly equivalent to the structure of the corresponding classical decision tree T cl except for the additional split at depth d = 1. We compare the previously introduced classification performance metrics of the Q-trees QT 1 qc , . . . , QT 10 qc and the classical decision tree T cl .
The results are shown in Tab. 2. We find that all best metrics of the Q-trees are superior to the corresponding metrics of the classical tree T cl . Furthermore, the majority of the metrics    (78) and (84) to (87), of the Q-trees QT 1 qc , . . . , QT 10 qc , the exemplary Q-tree QTqc, Eq. (88), and the classical decision tree T cl , respectively, in analogy to Tab. 1. When calculating the mean and standard deviation, we ignore all metrics that are not well-defined due to a division by zero. The best results (with respect to two digits) are highlighted in bold. bac acc pre 0 pre 1 rec 0 rec 1 f1 0 f1 1 Mean Q-tree metric 0.71(11) 0.76 (8)  of the exemplary Q-tree QT qc (bac, acc, pre 0 , rec 1 , and f1 1 ) are better than or equal to the corresponding metrics of the classical tree, whereas the others (pre 0 , rec 1 , and f1 1 ) are worse. Due to their similar structure, the only difference between the predictions of QT qc and T cl comes from the path (or bit string)x 0x1 = 11. For the corresponding leaf, the Q-tree predicts y 1 = 1, whereas the classical tree estimates balanced label probabilities that lead to a prediction y 1 = 0 (according to the scikit-learn default behavior). All advantages of QT qc are therefore a consequence of the noise from the quantum sampling process. Given these facts, the apparent superiority that can be inferred from Tab. 2 is questionable, and there is also no fundamental reason why our quantum representation should lead to better predictions than the classical representation. Given the simplicity of the problem, this is no surprise. However, our results at least indicate that the quantum representation running on a NISQ device is (for this particular example) not significantly inferior to the classical representation.
To verify that our tree induction approach is superior to a random guess, we create a set of 10 randomly generated decision trees by uniformly drawing decision configurations, Eq. (12), and subsequently sample with N = 8192 measurements to obtain an estimate for the label probability distribution. The resulting a mean and standard deviation (in brackets) for the balanced accuracy, Eq. (78), over all trees reads 0.51 (2) and is therefore significantly worse than the mean balanced accuracy over all induced trees listed in Tab. 2.
Next, we consider the estimations for the probability distributions for the best Q-tree QT qc in analogy to our experiments on the simulator. The results are shown in Fig. 17. We find that some of the sampled probabilities deviate significantly from the corresponding exact probabilities. We suppose that these deviations result from the hardware-induced noise.
To further quantify our observation, we take a closer look at the deviations. For this purpose, we assume that statistical errors of the estimated probability distributions are negligible due to a sufficiently large sampling size in comparison with the hardware-induced noise as discussed in App. D. Consequently, we can make use of the approximations in Eqs. (143) and (144), respectively. We furthermore presume that the hardware-induced noise, Eqs. (139) and (140), can be described by a truncated normal distribution N 1 0 (x; µ, σ) within the interval (0, 1) with mean µ and standard deviation σ, where x ∈ (0, 1). Consequently, we arrive at the effective probability distributions B (n; N, p) ≈ N 1  Tab. 3 and is plotted with respect to one of the four separate axes shown on top. An inset plot shows the results forx 0x1 = 00,x 0x1 = 01, and x 0x1 = 10 in more detail. We use the same symbols as in (a). Figure 17: Sampled probability distributions of the Q-tree QTqc, which is also shown in Fig. 15a, using N = 8192 measurements. We use a similar notation as in Fig. 13. All quantum computations have been performed on actual quantum hardware. For comparison, we also show the exact probability distributions based on the data proportions as well as estimations for the hardware-induced noise ("fit") based on the effective probability distributions from Eqs. (89) and (90) with the fitted parameters from Tab. 3. For the effective probability distributions, a separate axis is provided on top of the plot for each bit stringx0x1.  (90), respectively. The expectation value shifts µ and µ represent hardware-induced systematic errors, whereas the standard deviations σ and σ represent hardware-induced statistical uncertainties. The resulting effective probability distributions are also shown in Fig. 17.
respectively. Here we introduce the expectation value shifts µ and µ , which represent hardwareinduced systematic errors, and the standard deviations σ and σ , which represent hardwareinduced statistical uncertainties. These parameters can be obtained by fitting Eqs. (89) and (90) to the estimated probability distributions from the measurements. The fitted parameters are listed in Tab. 3. The resulting effective probability distributions are also shown in Fig. 17. As expected, we find both non-vanishing statistical and systematic errors for all bit strings. The systematic errors are neither strictly positive or strictly negative. The largest overall statistical error σ = 0.0117 occurs for the bit stringx 0x1 = 11, whereas the largest absolute overall systematic error |µ | = 0.1617 occurs for the bit stringx 0x1 = 10. However, a discussion of possible physical origins of these hardware-induced errors goes beyond the scope of this manuscript.

Conclusions
We have proposed a quantum representation of classification trees in form of a quantum circuit, called Q-tree, which allows us to perform probabilistic tree traversals in the sense that it can sample from the corresponding probability distribution of the tree leaves. The proposed circuit consists of d + k + m qubits and O(2 k+m + d2 d ) gates (where d represents the tree depth, k the number of binary features and m the number of binary labels). Furthermore, we have also discussed how the sampled data can be used to feed a fitness function of a genetic algorithm to induce Q-trees in form of parameterized circuits. In addition, we have provided three approaches how Q-trees can be used for the prediction of labels for query data. First, by gathering samples and estimating the corresponding leaf probability distributions. And second, by modifying the circuit such that it directly draws samples from the distribution of leaves that a query distribution reaches, which allows to predict the labels for uncertain queries. As a third method, we have proposed an on-demand sampling approach to perform Q-tree predictions with a constant number of classical memory slots, independent of the tree depth. Using simple data sets, we have studied the induction and prediction capabilities of Q-trees in comparison with a classical benchmark using both a simulator and actual IBM quantum hardware. To our knowledge, this is the first realization of a decision tree classifier on a quantum device. Based on our experiments, we have found that the performance of Q-trees is comparable to the considered classical benchmark.
However, we have also found that the field of application of Q-trees is very limited on NISQ devices and hardware-induced noise can have a major impact on the quality of the results. Investigation of the hardware and data requirements necessary for the application of Q-trees could be the subject of future work. In addition, explicitly modifying the Q-trees to better fit the demands of the hardware could be one way to further improve the results and broaden the field of application. For example, we have mentioned how the circuit complexity can be reduced by introducing ancilla qubits (for both the data encoding and the structure encoding). An application of these approaches could lead to shallower Q-trees, which might be more suitable for NISQ devices. Furthermore, our proposed on-demand sampling approach can also be modified to mitigate hardware-induced noise by employing the approach presented in Sec. 4.3. Further research is necessary to evaluate this topic in more detail.
There are various possibilities to extend or alter our proposed approach for different applications. For example, it can be generalized by considering decision trees with missing splits by an appropriate modification of the compressed decision configuration. Another obvious extension is the consideration of non-binary features and labels, respectively, which requires a suitable encoding with possibly more than one qubit per feature and label. Our approach could also be modified to allow decision trees for regression problems by employing a suitable probabilistic representation of the problem. All of these conceptional ideas can be considered as possible research directions for further work. Moreover, instead of single decision trees, another extension of our approach are ensembles of Q-trees in the sense of random forests [67]. In App. E, we briefly discuss a possible strategy for such an extension.
In this work, we have only discussed heuristic tree induction methods. We have specifically proposed a genetic algorithm in App. B, for which we have also suggested possible variations. However, quantum computers can in principle also be used to induce global optimal decision trees [5]. For this purpose, the corresponding mixed-integer optimization problem can for example be converted [68] into an unconstrained binary quadratic programming problem [69], which can then be solved with an appropriate quantum algorithm [70]. We consider such or a similar approach as another promising research direction for further studies. and require that n exists exactly once in I. Moreover, we make use of the vector transformation with ω ∈ {0, 1} and u ∈ {l − 1, l}. It contains the composition of I with l, n ∈ {1, . . . , k}. We also recall Eq. (8) and introduce with 0 ≤ v ≤ l to identify the nodes along the subpath (with depth v < l) of a node of depth l, which is identified by a bit stringx 0 · · ·x l−1 . In particular, one has ν l (x 0 · · ·x l−1 ) = ν l (x 0 · · ·x l−1 ). Conversely, the inverse transformation rule C → E for any depth l > 0 reads such that the feature indices e i j are defined by the corresponding elements of Eq. (95). These elements represent a permutation of the elements of the index vector J, Eq. (91). We briefly illustrate the transformations of J in Fig. 18.
Summarized, a decision tree T can be parameterized by either E, Eqs.

B Genetic induction algorithm
In the present appendix section, we describe a concrete realization of a genetic algorithm for the induction of a decision tree T , Eq. (15), as introduced in Sec. 3.2. For this purpose, we assume Algorithm 1 Genetic tree induction algorithm 1: function GeneticGrow(θ) 2: χ best ← Initalize(1) 3: if P > 1 then 4: χ ← Initalize(P ) 5: g ← 0 6: while g < G do 7: χ best 1 ← Best(χ, f ent , f bac ) 8: end if 23: return Best(χ, f ent , f bac ) 24: end function that a constant maximum depth d is given for the tree to be induced. Furthermore, one can either choose to use the classical or the quantum representation of the tree T as described in Secs. 3 and 4, respectively.
The proposed genetic algorithm follows the typically used "traditional" structure and makes use of well-known genetic operators [21]. It is outlined in Alg. 1 and particularly depends on the hyperparameters θ ≡ (P, G, t, p c , p m , p a , f ent , f bac ) consisting of the total population size P ∈ N, the number of iterations (or generations) G ∈ N, the selection subset size t ∈ N, the probability of mating two individuals p c ∈ [0, 1], the probability of mutating an individual p m ∈ [0, 1], the probability for each attribute to be mutated independently p a ∈ [0, 1], and the fitness function weights f ent , f bac ∈ R 0+ . We denote a population of P ∈ N chromosomes (i. e., candidate solutions) of the form defined in Eq. (16) by the P -tuple χ ∈ S P , where we recall the solution domain S, Eq. (17). The chosen maximum depth d consequently defines the dimensionality of the chromosomes. Furthermore, we denote the ith chromosome by A central component of the algorithm is the fitness function, which we define as It consists of two parts weighted by f ent and f bac , respectively. The first part is a measure for the average entropy of the probability distributions of the leaves of the tree, where the first sum iterates over the indices ν of all leaves of the tree T , Eq. (8), with the corresponding set of conditions C ν , Eq. (28). Here we introduce the information entropy [71] S[p(y)] ≡ −p(y = 0) log 2 p(y = 0) − p(y = 1) log 2 p(y = 1) of a probability distribution p(y) with support B. The second part represents the mean balanced accuracy in accordance with Eq. (78). Here we make use of the abbreviations and which are based on and for the probability distribution of reaching a certain node, where we recall Eqs. (30) and (60), and p cl/qm (y i | C ν ) ≡ p(y i | C ν ) for the classical representation p(y i | C ν ) for the quantum representation (112) for the probability distribution of the ith label with i ∈ {1, . . . , m}, where we recall Eqs. (32) and (61), respectively. Finally, one haŝ for the quantum representation (113) for the prediction of the ith label according to Eqs. (40) and (64), respectively. The information entropy, Eq. (104), attains its smallest value 0 for p(y = 0) ∈ {0, 1} and its largest value 1 for p(y = 0) = 1 2 . The choice p(y) = p cl/qm (y i | C ν ) therefore leads to the following implication: Since a smaller information entropy corresponds to a less equal partitioning of data with different labels (in the sense of a smaller variance), it indicates a more favorable splitting decision with respect to its data partitioning capabilities.
In particular, the information entropy of two simultaneous events is no more than the sum of the information entropies of each individual event [72]. A splitting decision can therefore not increase the information entropy such that holds true. This measure is in particular closely related to the well-known information gain criterion [2] for the induction of decision trees. Maximization of F ent cl/qm is equivalent to finding the minimum weighted average of the information entropy in all leaves over all label dimensions and can therefore be considered as a measure for the partitioning capabilities of all splitting decisions in the tree. An increased data partitioning is favorable since it represents a more conclusive tree.
On the other hand, the balanced accuracy, Eq. (105), is a measure of how well the splitting decisions are able to explain the training data D train . In the fitness function, we combine both the conclusiveness Eq. (103) and the accuracy Eq. (105) to a combined metric for the quality of the splitting decisions in the tree (defined by C and D train or b 1 , . . . , b N ). By tuning the (non-negative) weights f ent and f bac , the influence of the two parts can be controlled for a specific use case. For the sake of convenience, one can also set f bac = 1 − f ent to eliminate one hyperparameter.
Alg. 1 contains the following functions: • Initalize(P ): P chromosomes are created by drawing them uniformly from an appropriate set of possible values, Eq. (17). The chromosomes are returned as a P -tuple, where P ∈ {1, P }.
• Select(χ, t, P − 1, f ent , f bac ): Selection of P − 1 chromosomes with the tournament selection method. Specifically, t chromosomes are drawn uniformly from χ and the one with the best fitness, Eq. (102), is selected. This process is repeated P − 1 times and the (P − 1)-tuple of selected chromosomes χ gen is returned. By keeping track of the fittest chromosome in χ best ∈ S 1 , we ensure that best fitness in each generation cannot be worsened. This approach is also known as elitism [73]. The returned result of GeneticGrow(θ) is a chromosome that maximizes F cl/qm in the sense of Eq. (19).
The evaluation of the fitness function F cl/qm in Select(χ, t, P −1, f ent , f bac ) and Best(χ, f ent , f bac ), respectively, implicitly depends on D train and possibly b 1 , . . . , b N , depending on whether the classical or the quantum representation of the tree T is considered, Eq. (110). In particular ,  F (b 1 , . . . , b N , D train ) represents a noisy fitness function. To handle this case, we perform an eval- uation ofF (b 1 , . . . , b N , D train ) and assign the fitness to the corresponding chromosome on its initialization or whenever it is affected by a crossover or a mutation. In addition, we average the fitness values over identical chromosomes in χ at the end of each iteration (i. e., after line 18) and re-assign the mean fitness to the associated chromosomes. By doing so we can reduce the influence of statistical errors on the final outcome. The on-demand sampling proposed in Sec. 4.3 can also be used to reduce the classical memory required for the evaluation of the quantum fitness function.
The algorithm presented here can be varied in many different ways, which may lead to better or worse results depending on the specific application at hand. For example, instead of having a constant maximum depth d, a variable maximum depth can be achieved by considering chromosomes of different length, which can each be assigned up to a specified maximum depth. Furthermore, instead of optimizing the tree as a whole, an iterative layer-by-layer optimization could be performed by optimizing only the decision configuration of the current layer from depth 0 to depth d − 1 (or up to a variable depth). Finally, the fitness function F cl/qm can be modified in various ways, for example by incorporating another common splitting criterion like the Gini index [2]. One could also consider a multi-objective genetic algorithm [21] with one fitness value for each of the two parts or one fitness value for each label (instead of performing the average over all labels i ∈ {1, . . . , m} in Eqs. (103) and (105)).

C Q-tree measurements
In Secs. 4.1.1 and 4.1.2, we introduce the Q-tree data encoding and structure encoding, respectively. In the present section, we briefly show how the form of these encodings can be used to derive the probability distribution of the subsequent projective measurement, Eq. (56), which is important for the discussion in Sec. 4.1.2.

D Uncertainty
In the present appendix section, we summarize how we evaluate the statistical and hardwareinduced uncertainties of probabilities estimated from fractions of measurement counts [74] using frequentist confidence intervals [75]. First, we consider the ratio of a binomial random variable to a constant number of trials, which allows us to determine confidence intervals for Eqs. (60) and (69), respectively. Second, to determine confidence intervals of Eqs. (61) and (63), we consider the case where the number of trials is also a binomial random variable. Finally, we use error propagation to estimate the uncertainty of Eq. (70). We conclude with a brief discussion of hardware-induced noise. A mathematical discussion of quantum sampling and its relation to classical sampling can be found, e. g., in [76]. First, presume a random variable n ∼ B(n; N, p) following the binomial distribution with N ∈ N independent Bernoulli trials of success probability p ∈ [0, 1] such that n ∈ {0, . . . , N } represents the drawn number of successes. Since the expectation value of the fraction of successes obeys the success probability for a finite set of trials can be estimated by the fraction of successes according to which we use as a confidence interval [77] p(n, N ) ± z α/2σp (n, N ) for p with target error rate α ∈ [0, 1]. Here, z α/2 denotes the 1 − α/2 quantile of the standard normal distribution. In particular, Eq. (121) can be used to estimate the confidence intervals of Eqs. (60) and (69), respectively. For this purpose, we set n ≡ n(C ν ) and identify N with the total number of measurements. Second, we consider the estimation of a fraction k/m of two random variables k, m ∼ D(k, m; N, p , q), where the the probability distribution of jointly drawing k and m is given by the product with p ∈ [0, 1], q ∈ (0, 1] and N ∈ N such that k ∈ {0, . . . , m} and m ∈ {1, . . . , N }, respectively. Here, we recall the binomial distribution, Eq. (117), and introduce the truncated binomial distribution which considers only one or more successes (i. e., m ≥ 1) and contains the abbreviation and respectively, in analogy to Eq. (118) and consequently in analogy to Eq. (119). We use the corresponding standard deviation with the expectation value as a confidence intervalp in analogy to Eq. (121). Here, denotes the generalized hypergeometric function with the Pochhammer symbol (·) l . Furthermore, we use q ≈ m/N according to Eq. (126) with N 1. Eq. (130) can be used to estimate the confidence interval of Eqs. (61) and (63). For this purpose, we set k ≡ n yi i (C ν ), m ≡ n(C ν ), and identify N with the total number of measurements.
Finally, we determine the uncertainty of Eq. (70) using standard error propagation (i. e., we neglect correlations and assume independent variables), which yields the standard deviation Here, σp q (C ν ) and σp(y i | C ν ) represent the standard deviations of the corresponding probability densities, Eqs. (61) and (69), which are given by Eqs. (120) and (128), respectively. So far, we have only discussed statistical uncertainties. However, measurements on real quantum computers may also exhibit additional uncertainties from hardware imperfections. Such noiseinduced uncertainties may, e. g., occur for Eqs. (63), (69) and (70). In general, noise-free measurement results from quantum computers are obtained by sampling from a probability distribution P (ρ,M ) = Tr M †M ρ (133) based on a density matrix ρ, which results from the application of gate operations on the initial state, and a measurement operatorM (with Hermitian conjugateM † ). According to Eq. (56), the probabilities p, p , and q from Eqs. (117) and (122) can be identified with an expression of the form of Eq. (133) when applied to Eqs. (63), (69) and (70), respectively. When including noise from hardware imperfections [78], one can use a master equation approach [79]  with the noise-perturbed density matrix ρ , which can be expressed by [80] ρ ≈ ρ + δρ (135) based on the approximated noise perturbation δρ such that P (ρ ,M ) ≈ P (ρ,M ) + Tr M †M δρ .
This perturbation of the probability can also be included in Eqs. (117) and (122) if we treat the probabilities p, p , and q as random variables with a probability distribution determined by the hardware noise. Based on this presumption, we consider the generalizations B (n; N, p) ≡ respectively. Here we introduce the probability density (p; p) with support [0, 1] and the probability density (p,q; p , q) with support [0, 1] 2 to quantify the hardware-induced uncertainty. Based on these presumptions, suitably chosen probability densities for p, p , and q can be used to model the noise (e. g., a truncated normal distributions for a heuristic approach or a distribution derived from the explicit form of P (ρ ,M ) for a rigorous approach). Such a noise model may lead to a shift of expectation values, Eqs. (118) and (125), which corresponds to hardwareinduced systematic errors, and an increase of the standard deviations, Eqs. (120) and (128), which corresponds to a hardware-induced uncertainty.
In the noise-free case, these probability densities can be expressed in terms of Dirac delta distributions and (p,q; p , q) = δ(p − p )δ(q − q), respectively, such that Eqs. (137) and (138) respectively. This means that in this case the statistical uncertainty is negligible and the hardwareinduced uncertainty (if present) is dominating.

E Random forests
Random forests represent a well-known predictive model composed of an ensemble of decision trees for which a prediction can be inferred from the (weighted) averaged predictions of the individual trees [67]. Our proposed quantum representation of binary classification trees can be generalized to such ensembles using additional qubits and gates. The connection between decision tree ensembles and quantum computation has already been explored in previous works. For example, a conceptional quantum forest model based on quantum amplitude amplification is presented in [81]. A general discussion of quantum ensembles of quantum classifiers can be found in [82]. Furthermore, a quantum-inspired (i. e., involving no actual quantum computation) feature subset generation method for ensemble methods is presented in [83]. Furthermore, a discussion of probabilistic random forests can be found in [28].
An ensemble of f ∈ N trees T (1), . . . , T (f ) of the same maximum depth d can by characterized by the training data D train , Eq. (5), and the f -tuple which consists of the compressed decision configurations of all trees, Eq. (12). For the tth tree T (t), the compressed decision configuration is denoted by C(t) with t ∈ {1, . . . , f }. An additional probability distribution p trees with support {1, . . . , f } represents the relative importance of every tree in the forest by which its prediction is weighted.
|0 |Ψ forest (p trees , D) |Ψ forest (F, p trees , D) Figure 19: Q-Forest (circuit). Layout for our proposed quantum representation of a random forest consisting of f trees T (1), . . . , T (f ) of maximum depth d, which is uniquely defined by the training data Dtrain, the tuple of compressed decision configurations F , Eq. (145), and the probability distribution of the trees ptrees. The Q-Forest is a generalization of the Q-tree, Fig. 3. Initially, all qubits are prepared in the ground state |0 . The first set of unitary operatorsÛ data (Dtrain) andÛtrees(ptrees) perform an encoding of the training data and decision tree distribution, respectively, that leads to the state |Ψ forest (ptrees, Dtrain) , Eq. (148). Second,Û forest (F ) |Ψ forest (ptrees, Dtrain) performs an encoding of the decision configuration that leads to the state |Ψ forest (F, ptrees, Dtrain) , Eq. (149). Third, a projective measurement on is performed to obtain a bit string, Eq. (155), consisting of features, labels, and ancillas, which are drawn from Eq. (156) such that the circuit can be considered a quantum representation of the random forest.
A quantum representation of such a forest can be realized in analogy to our Q-tree, Sec. 4.1. In addition to the k + m qubits that represent the features and labels, the Q-forest circuit, which we also refer to as Q-forest for short, requires additional ancilla qubits (denoted by |a 1 , . . . , |a f or their respective indices k + m + 1, . . . , k + m + f ), which identify each tree. For the sake of simplicity, we assume that f is chosen such that f ≥ 2 and log 2 f ∈ N. We assume that all qubits are initially prepared in the ground state |0 in analogy to Eq. (41). The circuit consists of three parts. First, two unitary operators responsible for the data encoding. Second, a unitary operator responsible for the structural encoding and, third, a projective measurement. The proposed Q-forest circuit layout is sketched in Fig. 19. In the first step, a qsample encoding of the training data is performed by means ofÛ data (D train ) to obtain |Ψ (F, D train ) , Eq. (42). In analogy, the distribution of trees p trees is stored with a qsample encoding by means ofÛ trees (p trees ), which only acts on the ancilla qubits such that U trees (p trees ) |0 = a p trees (a) |a .
In the second step, the structure of the forest is encoded using for which we recall Eq. (51) and the standard gates NOT, C v SWAP (with f ≤ v ≤ f + d), and 1, respectively, from Sec. 4.1.2. Due to Eq. (153), the structural information of each tree is additionally conditioned by the ancilla qubits. An example of this structural encoding is shown in Fig. 20.
To induce a forest, the individual trees are typically induced independently on a subsample of the data that is randomly and with replacement drawn from D train , a method also known as bootstrap aggregating or bagging. In addition, random subsets of features can optionally be selected for each tree, a method known as feature bagging. The method described in Sec. 4.2 can therefore be applied for the induction of the Q-forest in the sense of a Q-tree ensemble, where each Qtree is trained using its own data set. To determine the distribution of trees p trees , the relative performance of the previously induced trees can be determined (e. g., using the prediction accuracy |x 1