EPISTEMIC ARTIFICIAL INTELLIGENCE: Using random sets to quantify uncertainty in machine learning Shireen Kudukkil Manchingal, Muhammad Mubashar, Maryam Sultana, Salman Khan and Fabio Cuzzolin School of Engineering, Computing and Mathematics Institute for Artificial Intelligence, Data Analysis and Systems Oxford Brookes University
[email protected]Abstract. Quantifying uncertainty is fundamental in machine learning tasks, including classification, detection in complex domains such as computer vision (CV) and text generation in large language models (LLMs). This is especially crucial when artificial intelligence (AI) is used in safety-critical applications, such as, e.g., autonomous driving or medical diagnosis, where reliable decisions are crucial to prevent serious consequences. The Epistemic AI project explores the use of random sets for quantifying epistemic uncertainty in AI. A mathematical framework which generalizes the concept of random variables to sets, random sets enable a more flexible and expressive approach to uncertainty modeling. This work proposes ways to employ the random sets formalism to model classification uncertainty over both the target and parameter spaces of a machine learning model (e.g., a neural network), as well as detection uncertainty, within the context of computer vision. The applicability and effectiveness of random sets is also demonstrated in large language models, where they can be utilized to model uncertainty in natural language processing tasks. We show how, by leveraging random set theory, machine learning models can achieve enhanced robustness, interpretability, and reliability while effectively modelling uncertainty. Keywords: Uncertainty quantification, random sets, artificial intelligence, machine learning, neural networks, computer vision, natural language processing. 1 Introduction Uncertainty quantification in machine learning is a crucial aspect that concerns the reliability and robustness of model predictions, particularly in complex real-world applications. It involves assessing and managing the uncertainties that arise from various sources, such as data quality, model parameters and the inherent variability in the phenomena being modeled. There are generally two main types of uncertainty [21]: aleatoric uncertainty, which stems from the inherent noise and variability in the data itself, making it irreducible through additional information (e.g., sensor noise in measurements); and epistemic uncertainty [19], which arises from a lack of knowledge or information about the model or the system, and can be reduced by acquiring more data or improving the model (e.g., uncertainty about the best model architecture) [43][41][54]. Understanding and quantifying these uncertainties is essential for developing more reliable machine learning systems, as it enables better decision-making, improves model interpretability and enhances performance in safety-critical applications. Random sets have been studied by numerous authors [63, 59], and play a foundational role in Dempster-Shafer theory (the theory of belief functions) [76] by providing a mechanism for representing and reasoning about uncertainty when exact probabilities are unavailable or hard to determine. In Dempster-Shafer theory [25, 22], random sets are used to assign probabilities not to individual outcomes, but to sets of outcomes, capturing uncertainty about which specific outcome will occur. The rationale behind using random sets is that, for any observation or piece of evidence, the exact outcome is unknown, but a subset of possible outcomes can be assigned. Each subset represents the possible outcomes that are consistent with the evidence. This assignment can be mathematically formalized in terms of belief functions (BFs) and plausibility functions (see Section 2), which, in turn, provide a flexible way to model uncertainty. In this paper, we explore the use of random sets and belief functions for quantifying uncertainty across various machine learning tasks, as part of the ongoing Horizon 2020 Epistemic AI project1 . We first recall the basics of the theory of random sets in Sec. 2, and discuss the state of the art in terms of applications of random sets and belief functions to machine learning (Sec. 2.2). In Sec. 3, we introduce the concept of epistemic deep learning [54], as a framework for quantifying epistemic uncertainty in AI using random sets. Next, we explore how random sets can be leveraged to model classification uncertainty over both the target (output) space of a model (e.g., a deep neural network, Sec. 4) and its parameter space (e.g., the space of weights of a neural network, Sec. 5). Additionally, in Sec. 6, we demonstrate how random sets can be used for modelling uncertainty in computer vision applications, in particular object detection [44]. Finally, in Sec. 7, we consider the implications of random sets for quantifying uncertainty in large language models, a rapidly developing area in modern AI research. Finally, in Sec. 8 we conclude and discuss future work. 2 Random Sets Random sets offer a versatile mathematical framework for representing and managing uncertainty, particularly when uncertainty pertains to a set of outcomes rather than individual points. A number of authors have given significant contributions to the theory of random sets over the years, including Nguyen, Goutsias and Smets, to cite a few [61, 38, 62, 78]. Ilya Molchanov’s seminal work, Theory of Random Sets [59], in particular, has elegantly formalized this approach, which extends the classical concept of random variables to set-valued mappings. A random set X is a measurable mapping from a probability space (Ω , F , P) to the family of closed sets F (Rd ), such that: X : Ω → F (Rd ), where X(ω) is a closed subset of Rd for each ω ∈ Ω . The distribution of a random set can be described using lower and upper probabilities. For a Borel set B ⊆ Rd , the lower probability is given by the probability that the random set intersects B: P∗ (X ∩ B ̸= 0), / while the upper probability represents the chance that the random set is fully contained in B:P∗ (X ⊆ B). These 1 https://www.epistemic-ai.eu/ capacities generalize traditional probability distributions and allow for more expressive modeling of uncertainty in set-valued outcomes. The expected value of a random set X, known as the Aumann expectation [6], extends the expectation of random variables to sets and is defined as the set of expectations of the individual elements of X:E[X] = {E[x] : x ∈ X(ω), ω ∈ Ω }. In this paper, we utilize random sets within the framework of belief function theory [76, 62], both in the classical discrete case and on real numbers [79]. 2.1 Random Sets and Belief Functions Let us denote by Ω and Θ the sets of outcomes of two different but related problems Q1 and Q2 , respectively. Given a probability measure P on Ω , we want to derive a ‘degree of belief’ Bel(A) that A ⊂ Θ contains the correct response to Q2 . If we call Γ (ω) the subset of outcomes of Q2 compatible with ω ∈ Ω , ω tells us that the answer to Q2 is in A whenever Γ (ω) ⊂ A (see Figure 1). The degree of belief Bel(A) of an event A ⊂ Θ is then the total probability (in Ω ) of all the outcomes ω of Q1 that satisfy the above condition [27]: Bel(A) = P({ω|Γ (ω) ⊂ A}) = ∑ P({ω}). (1) ω ∈Ω :Γ (ω )⊂A The map Γ : Ω → 2Θ = {A ⊆ Θ } is called a multivalued mapping from Ω to Θ . Such a mapping, together with a probability measure P on Ω , induces a belief function on 2Θ . In Dempster’s original formulation [26], belief functions are objects induced by a source probability measure in a decision space for which we do not have a probability, as long as there exists a 1-many mapping between the two. P���Ω����������� ω Θ Bel��������������� Θ Ω Γ Γ�ω� A Fig. 1. A multivalued mapping linking a probability on a source space Ω to a belief function on the decision space Θ . Belief functions can also be defined axiomatically on the domain of interest (frame of discernment) Θ , without making reference to multivalued mappings. A basic proba/ =0 bility assignment (BPA) [76] is a set function [28, 32] m : 2Θ → [0, 1] such that m(0) and ∑A⊂Θ m(A) = 1. In Dempster’s interpretation, the ‘mass’ m(A) assigned to A is in fact the probability P({ω ∈ Ω : Γ (ω) = A}). Shafer and Smets [77], amongst others, have supported a view of mass functions as independently defined on Θ . Subsets of Θ whose mass values are non-zero are called focal elements of m. The belief function (BF) associated with a BPA m : 2Θ → [0, 1] is the set function Bel : 2Θ → [0, 1] defined as: (2) Bel(A) = ∑ m(B). B⊆A . The corresponding plausibility function is Pl(A) = ∑B∩A̸=0/ m(B) ≥ Bel(A). A further equivalent definition of belief functions can be provided in axiomatic terms [76]. Classical probability measures on Θ are a special case of belief functions (those assigning mass to singletons only), termed Bayesian belief functions [23, 18]. 2.2 Random Sets in Machine Learning Although limited in scope, some work has been previously done on the application of random sets to machine learning. Particularly, in a texture classification [34] framework, texture can be modeled as a random set of morphological features, allowing the classifier to account for uncertainty inherent in natural textures. This is particularly useful in cases where images have noisy or uncertain texture boundaries, such as medical imaging or satellite image analysis. By using random sets, the model can make more robust texture classifications by incorporating uncertainty directly into the feature extraction process. Another approach is the Random k-Labelsets method [87], which constructs classifiers by randomly sampling subsets of labels (k-labelsets) and training classifiers on these random sets. The use of random sets in this context helps account for the uncertainty about which labels co-occur, making it an effective method for tasks like text categorization, image annotation, or bioinformatics where instances often belong to multiple classes simultaneously. By randomizing the selection of label subsets, the method is better able to generalize to unseen data and handle ambiguous or uncertain label assignments. Belief Functions The theory of belief functions has previously been integrated into models to enhance their uncertainty estimation capabilities or abstain from decision making. For instance, and evidential classifier [85] that leverages Dempster-Shafer theory within a deep learning context has been recently proposed. The model incorporates the calculation of mass functions into the final layers of a neural network and provides predictions over the power set of classes. Using a related credal representation, Zaffalon has proposed Naive Credal Classifiers (NCC) [102] as an extension of the naive Bayes classifier to credal sets, where imprecise data probabilities are included in models in the form of sets of classes. Antonucci [3], on his side, has proposed graphical models that generalize NCC to multilabel data. Expected utility maximization algorithms, such as finding the Bayes-optimal prediction [60], and approaches to classification with reject option for risk aversion [64], have been proposed which are also based on set-valued predictions. Additionally, the ability to combine different sources of evidence using Dempster’s rule of combination is particularly valuable for machine learning models tasked with high-stakes decision-making [30] in safety-critical applications like autonomous driving and medical diagnostics. As described earlier, belief functions and random sets offer a flexible approach to decision-making under uncertainty by assigning degrees of belief to various sets of hypotheses. Evidential approaches Within a proper epistemic setting, a significant amount of work has been done by Denoeux and co-authors, and Liu et al. [35], on unsupervised learning and clustering in particular in the belief function framework. Quite a lot of work has been done on ensemble classification in the evidential framework [99] (in particular for neural networks [72]), decision trees [33], K-nearest neighbour classifiers [29] and, more recently, on evidential deep learning classifiers able to quantify uncertainty [85]. Tong et al. [85], more specifically, have proposed a convolutional neural network based on Dempster-Shafer theory called evidential deep classifier which employs utility functions from decision theory to assign utilities on mass functions derived from input features to produce set-valued observations. Another fairly recent approach by Sensoy et al. [75] proposes an evidential deep learning classifier to estimate secondorder uncertainty in the Dirichlet representation. This work is based on subjective logic and learns to form subjective opinions by minimizing the Kullback-Leibler divergence over a uniform Dirichlet distribution. These methods represent predictions as Dirichlet distributions, as explored in various works over the last few years [75, 51–53, 12]. However, many commonly used loss functions for these networks are flawed, as they fail to ensure that epistemic uncertainty diminishes with more data, violating basic asymptotic assumptions [9]. Additionally, some approaches require out-of-distribution (OoD) data for training, which may not always be available and does not guarantee robustness against all types of OoD data. Studies [88] have shown that OoD detection degrades in certain models under adversarial conditions. Even techniques such as normalizing flows (NFs) in posterior networks [12], while effective, can struggle with OoD data when relying on learned features [45, 83]. For a much more extensive survey of the use of belief functions and random sets in machine learning, the reader is referred, for instance, to [19], Chapter 5. 3 Epistemic Deep Learning The main aim of this paper is to introduce an epistemic deep learning concept based on random sets and belief functions. Our approach to developing an epistemic artificial intelligence [24] theory rests on the Socratic principle of ‘learning from data the model cannot see’, for the training data an AI model or network uses to learn how to solve a task is invariably a very small fraction of all the data theoretically available. In particular, we outline a research programme aimed at developing a new class of artificial neural networks [101] able to model epistemic learning, in a random set/belief function framework. We call this approach ‘epistemic deep learning’, and argue that a deep neural network [47] producing an outcome for (selected) subsets of the target space, or learning from data a random-set representation on its parameter space, is an intrinsically more faithful representation of the epistemic uncertainty associated with the limited quantity and quality of the training data. We first illustrate the principles of epistemic deep learning (Section 3.1). A key step towards this is re-defining the ground truth of observations to represent the set of all subsets of the relevant target space (i.e., its power set). We then briefly recall the main features of deep neural networks (Sec. 3.2), to then make a distinction between uncertainty representations in the target space as opposed to in the parameter space of a network (Sec. 3.3). 3.1 Epistemic Artificial Intelligence Observed training data Similar data Environment Learning Observation Explains Model space (empty) Classical AI Model explaining training data + some generalisation (a) Observed training data Unobserved relevant data Learning Observation Explain State of ignorance (all models possible) Epistemic AI Set of models compatible with the training data (b) Fig. 2. Epistemic AI’s notion of learning (b), as opposed to that of traditional machine learning/artificial intelligence (a). The principle of epistemic artificial intelligence is illustrated in Figure 2. While traditional machine learning learns from the (limited) available evidence a model is able to describe with limited power of generalisation (a), epistemic AI (b) starts by assuming that the task at hand is (almost) completely unknown, because of the sheer imbalance between what we know and what we do not know. Our ignorance is then only tempered in the light of the (limited) available evidence to avoid forgetting how much we ignore about the problem or, in fact, that we even ignore how much we ignore. This principle translates into seeking to learn sets of hypotheses compatible with the (scarce) data available, rather than individual models. A set of models may provide, given new data, a robust set of predictions among which the most cautious one can be adopted. Mathematically, this can be done by modelling epistemic uncertainty via a suitable uncertainty measure on the target space Θ for the task at hand (e.g., the list of classes C for classification [46] or a subset of Rn for regression problems [56]). In particular, in this paper we focus on random-set representations. 3.2 Deep Neural Networks Artificial Neural Networks (ANNs) are computational models inspired by the biological neural networks that constitute animal and human brains. ANNs are typically composed by layers of artificial neurons, taking a number of inputs x1 , . . . , xn ∈ X belonging to an input space X and producing an output y ∈ Y belonging to an output (or target) space Y . This mapping is the result of applying a non-linear activation function f to a weighted linear combination of the inputs x1 , . . . , xn , each of which is multiplied by a weight factor: y = f (∑(i = 1, . . . , n)wi xi ). Commonly used are the sigmoid f (x) = 1/(1 + e−x ) and the rectifier linear unit (ReLU) f (x) = max(0, x) activation functions. Artificial neurons can be arranged into multiple layers to form a network. The simplest example of such an architecture is the seminal multilayer perceptron [69]. In the past 12 years, deep neueral networks (DNNs) [84] composed of a very large number of layers and employing clever weight sharing or (more recently) attention architectures [81] have come to dominate the field of machine learning. The set of weights and biases of a (deep) neural network is called parameter space. Deep neural networks are trained in a supervised fashion (i.e., by showing the network a number of examples of input vectors together with the corresponding target values) by minimising a suitable loss function, whose analytical form depends on the task one intends to solve. In practice, this is done numerically using backpropagation [98], where the weights of all the connections in a neural network are computed backward, layer by layer, starting from the last one, by calculating the gradient (or vector of derivatives) of the error (discrepancy) between the output produced by the network and the desired one (which, in a supervised setting, is known for the training data points). 3.3 Target-level vs Parameter-level Representation While the rationale of epistemic deep learning (and epistemic AI more in general) is to model epistemic uncertainty using a suitable uncertainty measure, in practice this can happen at two distinct levels: – A target level, where the neural network is designed to output an uncertainty measure defined on the target space at hand, but its parameters (weights) are deterministic. – A parameter level, in which the network is modelled by an uncertainty measure on the parameter space itself (i.e., its set of weights). The two modelling levels are reflected in their training objectives. At target level, a loss function needs to be defined to measure the difference between predictions and ground truth, both in the form of uncertainty measures on the target space. However, this loss is a function of a deterministic weight assignment. In other words, target-level epistemic learning generalises traditional deep networks which output point predictions over set-valued ones. At parameter level, the objective is to recover an uncertainty measure defined on the set of weights, or more generally, the hypothesis space for that class of models. For a given weight configuration, we can assume that the output is of the same kind as a classical deep network: for instance, for classification, a set of scores that can be calibrated to a probability distribution over the target space. As a consequence, parameter-level epistemic networks will have the same output layers as traditional networks, but will differ in the training process. Target-level networks, instead, will have different output layers designed to output a more general uncertainty measure than a classical distribution. In what follows, we will first focus on the target-level representation (see Sec. 4), to then consider the parameter-level representation (see Sec. 5). 4 Random Sets for Classification with Target Space Uncertainty Building on the principles of epistemic deep learning, we design a novel class of neural networks called Random-Set Convolutional Neural Networks (RS-CNNs) [55] that can be trained to output scores for sets of outcomes, and encode the epistemic uncertainty associated with the prediction in the random set framework. In a standard classification task, where each input is classified into one of several mutually exclusive classes, the classical probability framework does assume that each class is distinct and independent. Typically, a softmax model [5], implemented in the final layers of the network, provides the probabilities for each class, and these probabilities sum to 1. This approach works well when you are confident about the classification, for you have sufficient evidence to assign a clear probability to each class. However, it does not capture the uncertainty or ambiguity that exists when the evidence (as provided by the training set) is insufficient (e.g., because you did not sample similar points at training time, or because the test data is affected by distribution shift). Relying solely on confidence measures (such as classical softmax probabilities) is insufficient for a comprehensive analysis of model predictions. This is a problem known in machine learning as calibration [8]. Our approach uses random sets as detailed in Sec. 2, to offer a robust framework for assessing prediction reliability. 4.1 Random-set Convolutional Neural Network Consider the following: a classifier e (e.g., a neural network) is a mapping from an input space X to a target space Y = C (the set of classes), i.e., e : X → C . In our set-valued classification setting, on the other hand, e is a mapping from X to the set of all subsets of C , the power set P(C ), namely: e : X → P(C ). As shown in Fig. 3 (b), RS-CNN predicts for each input data point a belief function, rather than a vector of softmax probabilities as in a traditional CNN. For N classes, a ‘vanilla’ RS-CNN would have 2N outputs (as 2N is the cardinality of P(C )), each being the belief value of the focal set of classes A ∈ P(C ) corresponding to that output neuron. Given a training data point with attached a true class, its ground truth is encoded by the vector bel = {Bel(A), A ∈ P(C )} of belief values for each focal set of classes A ∈ P(C ), where Bel(A) is set to 1 iff the true class is contained in the subset A, 0 otherwise. This corresponds to full certainty that the element belongs to that set and there is complete confidence in this proposition. However, there exists a major limitation which curbs their use in practice, as the number of subsets increases exponentially with the number of classes and reaches an astronomical amount very quickly. While some circumvent this problem by limiting the number of focal sets by setting a threshold on cardinality [31], others ignore it altogether and only work with small datasets [75]. Here we propose a strategy, based on traditional, ‘hard’ clustering, for budgeting the number of focal sets and select only the most useful ones given the available training data. 4.2 Budgeting More in detail, to overcome the exponential complexity of using 2N sets of classes (especially for large N), a fixed budget of K relevant non-singleton (of cardinality > 1) focal sets are used, as shown in Fig. 3 (a). These focal sets are obtained by clustering the original classes C and selecting the top K focal sets of classes with the highest overlap ratio, computed as the intersection over union ratio for each subset A in P(C ): T Ac , A1 , . . . , AK ∈ P(C ). c∈A Ac overlap(A) = Sc∈A (3) The clustering is performed on feature vectors of images of each class generated by a standard CNN, trained on the original classes C . The feature vectors are further reduced to 3 dimensions using t-SNE (t-Distributed Stochastic Neighbor Embedding) [89] before applying a Gaussian Mixture Model (GMM) to them. Ellipsoids [82], covering 95% of data, are generated using eigenvectors and eigenvalues of the covariance matrix Σc and the mean vector µc , ∀c ∈ C , Pc ∼ N (xc ; µc , Σc ) obtained from the GMM to calculate the overlaps. To avoid computing a degree of overlap for all 2N subsets, the algorithm is early stopped when increasing the cardinality does not alter the list of most-overlapping sets of classes. The K non-singleton focal sets so obtained, along with the N original (singleton) classes, form our network outputs O = C ∪ {A1 , . . . , AK }. 4.3 Training Whereas standard approaches encode the (true) class of a training data point as a ‘onehot’ probability vector (i.e., assigning probability 1 to the true class, and 0 to all the others), RS-CNN encodes it using a belief vector bel = [Bel(A), A ∈ O], where Bel(A) Fitting Gaussian Mixture Models (GMM) on each class Top K relevant subsets truck car {car,truck, vehicle} {car,truck} vehicle Feature vector over Overlapping clusters over (a) Input (e.g, image) Pignistic Probability Predicted belief Pignistic Entropy H( ) Uncertainty estimate Inference p(c2) = 1 Minimize p(c1) = 1 Credal set p(c3) = 1 Training Belief encoded ground truth of most relevant subsets (b) Fig. 3. RS-CNN model architecture. (a) Budgeting: Given a collection C of N classes, the top K relevant (focal) sets of classes {A1 , . . . , AK } are selected from the powerset P(C ) and added to the singleton classes to form the budget O. (b) Training and Inference: Ground truth classes ˆ for each training data are encoded as belief vectors bel and used to predict a belief function bel ˆ point by minimising the loss RS-CNN function, to train the output layers (in grey) producing bel. Mass values m̂ and pignistic probability estimates BetP are computed from the predicted belief function. Uncertainty is estimated as described in Sec. 4.4. is the belief value for a focal set A in the power set P of classes C . In our method, Bel(A) is set to 1 if the true class is in subset A, and 0 otherwise. Consequently, the belief-encoded ground truth bel will include multiple occurrences of 1. For instance, in a digit-classification task such as the one described in the MNIST dataset2 , if {3} is the true class, the belief-encoded ground truth would have a 1 in correspondence to each set containing {3}, such as {1, 3}, {0, 3}, {1, 2, 3} and so forth. This formulation closely resembles multi-label classification. Since we assign precise labels for belief containing different focal sets, our model begins by penalizing predictions that differ from the observed label in the same manner, regardless of the set’s composition or relationship to the true label. Using MNIST as an example, this means that the loss incurred when predicting label {3} is equivalent to the loss incurred when predicting label {3, 7}, as both sets contain the true label. This equivalence in losses might seem counterintuitive at first glance. However, despite the identical loss values, the probabilities output by the sigmoid activation function will vary due to differences in the input logit values for each label. For instance, if the correct label is {3}, the loss for predicting {3} or {3, 7} would be the same. Still, the loss for predicting {7} would differ, allowing the model to discern the set structure associated with a belief function during training. During the training process, the model learns to capture and understand these relationships by observing patterns and dependencies in the training data. As the model optimizes its parameters based on the training objective (by minimizing a suitable loss function, described below), it gradually adjusts its internal representations to better reflect these relationships. We use as the basis for our loss function the classical Binary cross-entropy loss with sigmoid activation [55]: LBCE = − 1 bsize 1 bsize ∑ |O| ∑ i=1 A∈O h i ˆ i (A)) + (1 − Beli (A)) log(1 − Bel ˆ i (A)) . Beli (A) log(Bel (4) Here, i is the index of the training point within a batch (small group of training data points) of cardinality bsize , A is a focal set of classes in the budget O, Beli (A) is the A-th component of the vector beli encoding the ground truth belief values for the i-th training ˆ i (A) is the corresponding belief value in the predicted vector bel ˆ i for the point, and Bel ˆ i are vectors of cardinality |O| for all i. This loss same training point. Both beli and bel allows the model to predict the presence or absence of each label separately as a binary classification problem, producing probabilities between 0 and 1 for each class. While the model is unaware that it is learning for sets of outcomes, we leverage this technique to extract mass functions and pignistic predictions from the learnt belief functions. Mass functions can be recovered from belief functions via Moebius inversion [76]. Additional regularization terms are employed in the loss to ensure that the network predicts valid belief functions whose masses are non-negative and normalized [55]. 2 https://www.kaggle.com/datasets/hojjatk/mnist-dataset 4.4 Prediction and Uncertainty Quantification ˆ we can compute a classiPignistic Prediction Given the predicted belief function bel, cal probabilistic prediction by computing its pignistic probability [80], i.e., the precise probability distribution obtained by re-distributing the mass of its focal sets A to its constituent elements, c ∈ A: BetP(c) = m(A) . A∋c |A| ∑ (5) Empirical results [55] show that such pignistic predictions are consistently more accurate than not only those of both standard CNN models, but also the most recent Bayesian deep networks, such as LB-BNN [39], epistemic neural nets [67] and FSVI [74]. Entropy of the Pignistic Prediction The Shannon entropy of the pignistic probability estimate BetP for each class c can then be used to assess the uncertainty associated with the predictions: HRS = − ∑ BetP(c) log BetP(c). (6) c∈N A higher entropy value (Eq. 6) indicates greater uncertainty in the model’s predictions. Size of the Credal Prediction A random-set CNN predicts a belief function (a finite random set) on the target space. A belief function, in turn, is associated with a convex set of probability distributions (a credal set [48, 103, 17, 4]) on the space of classes. The size of this credal set thus measures the extent of the epistemic uncertainty associated with the prediction. When the model is confident, the credal set will approach a single, precise probability. When uncertainty is high, the credal set will be wide. Any probability distribution P such that P(A) ≥ Bel(A) ∀A ⊂ Θ , is said to be consistent with Bel. Each belief function Bel thus uniquely identifies a set of probabilities consistent with it, P[Bel] = {P ∈ P|P(A) ≥ Bel(A)}, where P is the set of all probabilities one can define on Θ . Not all credal sets are consistent with a belief function. Credal sets associated with BFs have as vertices all the distributions Pπ induced by a permutation π = {θπ (1) , . . . , θπ (|Θ |) } of the singletons of Θ = {θ1 , . . . , θn } of the form [13, 16]: Pπ [Bel](θπ (i) ) = ∑ m(A). (7) A∋θπ (i) ; A̸∋θπ ( j) ∀ j<i Such an extremal probability (7) assigns to a singleton element put in position π(i) by the permutation π the mass of all the focal elements containing it, but not containing any elements preceding it in the permutation order [90]. Eq. (7) analytically derives the finite set of vertices which identify a random-set prediction. For a class c, the size of the credal set can be approximated by calculating the minimum and maximum (Eq. 8) of the extremal probabilities: π Pmin = min Pπ [Bel](θπ (i) ) Pπ [Bel](θπ (i) ) jc , π Pmax = max Pπ [Bel](θπ (i) ) Pπ [Bel](θπ (i) ) jc (8) where jc is the index of class c. These minimum and maximum extremal probabilities are the upper and lower bounds of the estimated probability for class c. The predicted pignistic probability estimate (5) falls within the interval of these upper and lower bounds, which indicates the epistemic uncertainty associated with the prediction. Credal π and Pπ , thus represents the episset width, measured as the difference between Pmax min temic uncertainty estimate associated with an RS-CNN prediction. 4.5 Experimental Results We show in [55] how RS-CNN outperforms state-of-the-art uncertainty estimation models and the standard CNN in performance and out-of-distribution OoD tests, provides reliable uncertainty quantification, is robust to noisy samples, and scales seamlessly to large-scale architectures and datasets. We compare RS-CNN with standard CNN, which lacks uncertainty estimation entirely, only to demonstrate RS-CNN’s effectiveness in achieving comparable, if not superior, performance while also offering reliable uncertainty measures. Further, while the standard CNN lacks uncertainty estimation, RS-CNN provides reliable uncertainty estimates and superior out-of-distribution performance. Given such results, our approach exhibits significant potential impact for safetycritical applications such as medical diagnostics and autonomous driving, where uncertainty estimation and OoD detection is crucial. Learning from sets enhances informativeness, making it particularly effective for real-world tasks involving image classification and analysis, especially in handling imprecise data where precise labels may not be available or appropriate. Sample Belief {’horse+’, ’bird+’} {’horse+’, ’dog+’} {’horse+’} {’horse+’, ’deer+’} {’cat+’, ’truck+’} Mass 0.9999402 0.9999225 0.9999175 0.9998697 7.0380207e-05 {’horse+’} {’cat+’, ’truck+’} {’ship+, ’bird+’’} {’horse+’, ’bird+’} {’dog+’} {’deer+’, ’cat+’} 0.4787728 {’deer+’} {’deer+’, ’airplane+’, ’bird+’} 0.4126398 {’cat+’, ’truck+’} {’horse+’, ’deer+’} 0.3732957 {’dog+, ’bird+’’} {’deer+’, ’bird+’} 0.3658997 {’horse+’, ’bird+’} {’deer+’, ’dog+’} 0.3651531 {’bird+’} Pignistic Entropy 0.9999175 6.859753e-05 4.094290e-05 2.250525e-05 1.717869e-05 horse truck cat bird ship 0.0017040148 0.9998833 3.5826409e-05 3.3859180e-05 3.3015738e-05 2.3060647e-05 0.3104962 0.1762222 0.0998060 0.0954350 0.0524873 deer cat horse bird dog 0.3332411 0.2230723 0.1153417 0.1086245 0.1039505 2.6955228261 Table 1. The predicted belief, mass values, pignistic probabilities and entropy for two CIFAR10 predictions. The figure on the top (True Label = ‘horse’) is a certain prediction with 99.9% confidence and a low entropy of 0.0017, whereas the figure on the bottom (True Label = ‘cat’) is an uncertain prediction with 33.3% confidence and a higher entropy of 2.6955. Here, we simply show some qualitative results and discuss uncertainty estimation. Tab 1, in particular, shows belief functions, masses and pignistic RS-CNN predictions for a couple of samples from the CIFAR-10 image classification dataset3 . In the top 3 https://www.cs.toronto.edu/ ˜kriz/cifar.html figure, corresponding to the true label ‘horse’, the model makes a highly confident prediction with 99.9% confidence and a low entropy of 0.0017. Conversely, the bottom figure, associated with the true label ‘cat’, represents an uncertain prediction with 33.3% confidence and a higher entropy of 2.6955. The reader will note that the second image is slightly unclear and poor in quality. Figs. 4 and 5 depict the relationship between the entropy of the RS-CNN pignistic prediction and the associated confidence level (Fig. 4), and between credal set width and confidence (Fig. 5), respectively, for the CIFAR-10 (left, assumed to model indistribution (iD) data) and the SVHN4 and Intel Image5 (right) datasets, assumed to model Out of Distribution (OoD) data. The distribution of iD predictions (left) for both tests show a concentration at the top left, indicating high confidence and low entropy or credal set width. Conversely, OoD predictions (middle, right) exhibit a more dispersed pattern. 1.0 iD: CIFAR10 OoD: SVHN OoD: Intel Image Confidence Score 0.75 8 0.5 Log Frequency 2 0.0 0.25 6 0.0 4 0.88 1.75 Entropy 2.62 3.5 0.0 0.88 1.75 Entropy 2.62 3.5 0.0 0.88 1.75 Entropy 2.62 3.5 0 Fig. 4. Entropy vs Confidence score on iD (left) vs OoD (right) datasets. For CIFAR-10, most predictions are concentrated top left of the plot indicating lower entropy and higher confidence in the predictions. For SVHN and Intel Image datasets, predictions are more distributed. 1.0 iD: CIFAR10 OoD: SVHN OoD: Intel Image Confidence Score 0.75 8 0.5 Log Frequency 2 0.0 0.25 6 0.0 4 0.88 1.75 Credal Set Width 2.62 3.5 0.0 0.88 1.75 Credal Set Width 2.62 3.5 0.0 0.88 1.75 2.62 Credal Set Width 3.5 0 Fig. 5. Credal Set Width vs Confidence score on iD (left) vs OoD (right) datasets. For CIFAR-10, confidence scores are high and credal set width is small. For SVHN and Intel Image datasets, credal set width varies for each prediction and is less reliant on confidence score. 4 http://ufldl.stanford.edu/housenumbers/ 5 https://www.kaggle.com/datasets/puneet6060/ intel-image-classification Entropy, reflecting prediction uncertainty, is quite correlated with confidence, in both iD and OoD tests. In contrast, as it considers the entire set of plausible outcomes within a belief function rather than a single prediction, credal set width better quantifies the degree of epistemic uncertainty inherent to a prediction and, As a result, credal set width is less dependent on the concentration of predictions and is more reflective of the overall uncertainty encompassed by the model. Fig. 5 shows that, unlike entropy, credal with is clearly not correlated with confidence (Fig. 5). 5 Random Sets for Classification with Parameter Space Uncertainty Whereas RS-CNN is an approach to outputting belief functions/random sets in the target space of a neural network, here we propose a method for learning, from the training data, a random set on the space of network parameters. We term this approach randomset wrapper. The approach, depicted in Figure 6, uses Bayesian neural networks [91] (BNNs) as baselines and transforms their learnt posteriors over the weight of a neural network into belief posteriors there, without requiring additional training. This method can effectively capture epistemic uncertainty, offering a robust, efficient and generic methodology for uncertainty quantification which generalizes the Bayesian approach to deep learning [96]. 5.1 Learning Bayesian Weight Posteriors The first step is to obtain weight posteriors from a baseline Bayesian Neural Network (BNN) model [42]. In BNNs, the network’s parameters, such as weights and biases, are modelled as probability distributions. By setting priors over these parameters, represented as ω, the goal of training a BNN is to learn the posteriors p(ω|D) using the training data D, through Bayes’ theorem: p(ω|D) = p(D|ω)p(ω)/p(D), (9) where p(ω), p(D), and p(D|ω) denote the prior, evidence and likelihood distributions, respectively. Due to the computational complexity of marginalizing the likelihood p(D|ω) over ω for making predictions with BNNs on a test instance x, Bayesian model averaging (BMA) is commonly used for inference [36]: N p i ybnn = hbnn (x) ≈ N1p ∑i=1 hω bnn (x) ∈ P(T), (10) where N p is the number of samples used to approximate the posterior distribution of the ωi parameters p(ω|D) during inference. The term hbnn refers to the deterministic model parameterized by ω i , sampled from the posterior distribution p(ω|D). Pre-Trained Model Weights Posterior Training Data Bayesian Neural Network Transformation of Weight Posterior to Belief Posterior Belief Functions on Borel-intervals Fitting Dirichlet BF to Masses Distribution using MLE Testing Prediction by INN Test Data Belief Posteriors Fig. 6. Our proposed random-set wrapper transforms posterior distributions on the network weights, as learnt by a Bayesian Neural Network, into belief posteriors there through a fivestep process. It involves extracting probability posteriors, calculating belief values over Borel intervals, computing mass values using Moebius inversion, fitting a Dirichlet distribution to these masses via Maximum Likelihood Estimation, and using the resulting belief posteriors as weights to Interval Neural Networks (INNs) for final predictions. 5.2 Continuous Belief Functions on Borel Intervals In the second step, belief values are calculated over Borel (closed) intervals of the real line within each of the posterior distributions. These intervals, denoted as [a, b], where a and b are the extremes, can efficiently represent subsets of a continuous sample space, making them useful for defining the support of belief functions in epistemic uncertainty [19]. The theory of belief functions on real numbers, which leverages a representation on closed intervals, is amply discussed in [79]. Figure 7 illustrates how belief functions can be represented on intervals of the real line (in particular, those included in [0, 1]). In our case, for any given posterior distribution over network weights learned by a BNN, we extract posterior distributions of these weights and construct belief values over Borel intervals (see Figure 6). 1 Left b extremum a 0 0 Right extremum a b 1 Intervals contained in [a,b] 1 0 a b 1 1 b b a a 0 a b 1 Pl([a,b]) Bel([a,b]) 0 0 Intervals containing [a,b] Fig. 7. Smets’ and Strat’s representation of belief functions on intervals. Left: frame of discernment for subintervals [a, b] of [0, 1]. Middle: the belief value for a subinterval [a, b] is the integral of the mass distribution over the highlighted area. Right: support of the corresponding plausibility value [19]. When constructing belief functions from statistical data, a possible approach, and the one we adopt here, is the likelihood-based method developed by Wasserman and Shafer [94]. This approach builds belief functions based on the likelihood function, which measures how well certain parameter values explain the observed data. Unlike Dempster’s method, the likelihood-based approach adheres to three critical principles: (i) the Likelihood Principle, which states that belief functions should be directly determined by the likelihood function; (ii) compatibility with Bayesian inference, which ensures that combining a Bayesian prior with the belief function yields the Bayesian posterior; and (iii) the Principle of Minimum Commitment, which maintains that among the belief functions satisfying the previous two principles, the one chosen should commit to the least amount of information necessary. More in detail, in our framework likelihood-based belief function inference is applied to (a selection of) intervals of network parameter values, allowing us to compute a belief function on the parameter space, based on likelihood functions derived from the Bayesian posterior. To calculate belief values, for any subset A of the parameter space Θ = [0, 1] we compute its plausibility by taking the supremum of the normalised posterior p̂(ω|D) across all ω ∈ A, namely: PlΘ (A|D) = sup p̂(ω|D). (11) ω ∈A The corresponding belief function is then calculated as the complement of the plausibility: BelΘ (A|D) = 1 − PlΘ (Ac |D), (12) ultimately providing a random-set representation in the parameter space. This method allows us to represent uncertainty as a range of possible values, rather than as a singular, definitive outcome, which better captures the epistemic uncertainty in model predictions. Note that, in practice, sample belief values are computed only on a grid of parameter values, as network parameter spaces are very high-dimensional. Once belief values are computed over a grid of closed intervals, mass values there can be easily obtained by Moebius inversion [76]. The resulting mass function representation serves as the basis for fitting a Dirichlet distribution to achieve a continuous pdf representation of the mass function over intervals of parameters. 5.3 Fitting a Dirichlet Distribution In the fourth step of the Random-Set wrapper, we employ Maximum Likelihood Estimation (MLE) to fit a Dirichlet distribution to the resulting mass values. Dirichlet distributions The Dirichlet distribution is a family of continuous multivariate probability distributions parameterised by a vector α of positive real numbers, in fact a multivariate extension of the beta distribution (Fig. 8): f (x1 , . . . , xK ; α1 , . . . , αK ) = 1 K αi −1 xi . B(α) ∏ i=1 (13) Dirichlet distributions are often employed as priors in Bayesian statistics. They are defined on the collection of vectors x ∈ [0, 1]K of dimension K whose coordinates add up to 1, and can thus be interpreted as second-order distributions. Fig. 8. Probability densities of the Dirichlet distribution (13) as functions on the 2-simplex: α = (6, 2, 6) (left), α = (2, 3, 4) (right). MLE procedure. As described by T. Minka [58], MLE for a Dirichlet distribution works by iteratively estimating the parameters α of the distribution. These parameters define the shape of the Dirichlet distribution, which models probabilities that sum to one. In the MLE process, the goal is to maximize the likelihood function, which, in this case, measures how well the Dirichlet distribution fits the observed mass values. The optimization procedure adjusts the parameters α to maximise this likelihood, ensuring that the Dirichlet distribution effectively captures the underlying distribution of the mass values. Minka’s method also involves using Newton’s method to efficiently update the parameters during the optimization, making the fitting process computationally feasible even for high-dimensional data. This approach is particularly suitable for our framework because it transforms discrete mass values into a continuous (Dirichlet) probabilistic model, allowing us to generate belief posteriors that provide a comprehensive and flexible representation of uncertainty. 5.4 Predictions Using Interval Neural Networks To generate the final predictions, we propose to utilise Interval Neural Networks (INNs) [65]. These networks handle interval-based representations of uncertainty in both inputs and weights, making them well-suited for capturing and propagating uncertainty throughout the network. Given a belief posterior in the parameter space, calculated as in the previous steps, one can sample one or more intervals of parameters from it, and feed them to an INN to generated one or more predictions, analogously to what is done in Bayesian deep learning. Traditional INNs operate with deterministic interval-based inputs, outputs, and parameters such as weights and biases for each node. Our approach, on the other hand, effectively encodes epistemic uncertainty (represented as random sets) into intervalbased predictions. We are now in the process of extensively validating this approach, and comparing it with state-of-the-art Bayesian and evidential competitors. 6 Random Sets for Detection under Uncertainty A similar approach leveraging a Dirichlet representation of continuous belief functions can be adopted to devise, within the field of computer vision, object detector networks aware of uncertainty. Object Detectors Object detection is a core computer vision task that involves identifying and localising objects within an image. The primary outputs of object detectors are class labels, bounding boxes (i.e., the four coordinates of the top left and bottom right corners of the smallest rectangular box containing the object of interest) and confidence scores. Class labels indicate what type of object is present, bounding boxes define the position and size of detected objects, and confidence scores express the likelihood of correct classification for each detection. Bounding boxes are represented using coordinates and dimensions and play a key role in accurately locating objects. Object detectors are typically divided into two main categories: two-stage detectors (e.g., Faster R-CNN [71]), which first generate region proposals before classifying, and single-stage detectors (e.g., YOLO [70], SSD [50]), which detect objects in Fig. 9. Epistemic object detection with uncertainty quantification using Dirichlet distributions and random sets (top). The model takes an input image and, using a CNN-based object detection model (bottom), predicts class probabilities with uncertainty modelled by random sets over intervals of bounding box coordinates (right). a single step, offering faster detection. During training, these models use a combination of loss functions. For localisation (bounding box regression), Mean Squared Error (MSE)/Mean Absolute Error (MAE) or IoU-based losses (e.g., GIoU, DIoU) are commonly used to measure the difference between predicted and ground-truth box coordinates. For classification, cross-entropy Loss is typically applied to predict the correct object class within each bounding box. For objectness (distinguishing objects from the background), binary cross-entropy Loss is used to determine whether a proposed region contains an object or not, ensuring accurate proposal generation. 6.1 Epistemic Object Detection Figure 9 illustrates our epistemic object detection concept. In this framework, bounding boxes are first transformed into Borel intervals, which are subsequently converted into a Dirichlet distribution. The model is trained to predict the alpha parameters of this Dirichlet distribution, reflecting the uncertainty associated with bounding box predictions. The model learns these parameters by using as a loss function the KullbackLeibler (KL) divergence between the predicted Dirichlet mass function and the groundtruth Dirichlet encoding of the true bounding box coordinates, optimizing the alignment between predicted and ground truth distributions. As solving the detection problem requires outputting both a bounding box around the object of interest and its predicted class, object classes also need to be represented using the random set framework. This can be done in the Random Set Convolutional Neural Network (RS-CNN) framework introduced in Section 4. By integrating these two mechanisms, we obtain an epistemic object detection model capable of quantifying epistemic uncertainty in both the bounding box locations and class labels, leading to more robust and interpretable object detection predictions. We are currently in the process of validating the concept on standard object detection datasets, as well as devising a suitable way of evaluating uncertainty-aware detections, and further extending the loss function for robustness. 7 Random Sets in Large Language Models Large Language Models (LLMs) [1], [2] , [86] have made significant waves in the recent years and have shown to perform very well on a wide range of NLP tasks like questionanswering [97], common sense reasoning [95], mathematical problem-solving [49] and code generation [73]. Typically, these models are pre-trained on a large corpus of text to predict the next token in an unsupervised fashion. To make these models usable, they are further fine-tuned on a particular application [15] and aligned to make sure the model behaves in an ethical manner and according to the human preferences. LLMs, however, still suffer from limitations in their capacity to understand information and often produce false statements or hallucinations [57]. This makes them less trustworthy and causes hindrance to deploying LLMs in high-stake decision-making applications where the consequences of incorrect decisions are severe. Therefore, there needs to be a mechanism to make LLMs more truthful (calibration) and to associate an uncertainty estimation with the model’s generation. Furthermore, they need to distinguish the source of this uncertainty as it can be aleatoric (relating to chance) or epistemic (relating to knowledge) uncertainty. Unfortunately, LLMs, in their current form, lack the capacity to do so. 7.1 Related Work Nevertheless, significant work has been done in this regard. Laplace-LORA [100] and BLoB [93] employ Bayesian methods over LoRA [40]. Similarly, methods like MonteCarlo Dropout (MCD) [36], Bayes by Backprop (BBB) [10] and Deep Ensembles [7] have also been used over LLMs to quantify uncertainty. ENN-LLM [66] measures uncertainty by using their ensemble inspired Epinet with LLMs. One stream of methods [14] use hidden state information of models to quantify uncertainty while others [68] just leverage the softmax entropy of the model output. However, almost all of these methods either suffer in inference times (due to the need running multiple times) or in terms of performance. 7.2 Rationale Motivated by the success of RS-CNNs in classification (Sec. 4), we propose instead the concept of Random-set Large Language Models (RS-LLMs). Instead of predicting a probability distribution over the vocabulary for the next token, RS-LLM will predict a belief function over the vocabulary. As described in Sec. 4, this will come with all the benefits of using a belief function as output and allow us to measure the uncertainty associated with the prediction both in terms of credal width and pignistic entropy (as specified in 4.4) while also allowing the model the capability to convey the lack of knowledge. 7.3 Methodology Random-sets are defined over the power set of the domain, but this is computationally impossible for a large number of classes, especially for large language models where the typical size of vocabulary is ∼ 32K tokens. Therefore, to tackle this problem, we need a method to select a budget of most appropriate focal sets and use only those as our random-sets. Inspired by RS-NN, we propose a strategy based on clustering the most similar tokens. This is appropriate because, by intuition, the model will most likely be confused among tokens which are similar to each other. The embeddings of tokens are computed using a pre-trained LLM and then hierarchical clustering is applied to cluster the similar tokens. Fig. 10 shows a detailed overview of the proposed budgeting method. . . . football basketball poodle bulldog dog ship . . . football basketball Embeddings t-SNE Clustering ship poodle dog bulldog . . . {football} {basketball} {football, basketball} {poodle} {bulldog} {dog} {poodle, dog, bulldog} {ship} . . . Fig. 10. Proposed budgeting method for RS-LLM. First, embeddings are computed for all the tokens in vocabulary. Then, t-SNE is applied for dimensionality reduction and focal sets are computed using hierarchical clustering The architecture of RS-LLM is based on the underlying LLM and only the last layer will need to be replaced and trained again. Given a token, its belief encoding is a vector with 1 for all the sets containing that token and 0 everywhere else. The loss function introduced in Sec. 4 for RS-CNN is used as the loss function here as well. At generation time, the model predicts a belief function. The next token is obtained by performing two-level sampling: first sampling a probability distribution from the belief function to then sample a token from the aforementioned probability distribution. The uncertainty estimate of a generated sentence is the mean uncertainty (credal width or entropy) associated with each token in that sentence. Fig. 11 shows the training and generation flow of RS-LLM. We are currently in the process of implementing this principle on a Llama2 model6 , and validate our hypothesis that modelling epistemic uncertainty in language models using random sets can significantly mitigate hallucination issues in natural language processing. Training Generation Ground- Ground- Ground- Groundtruth truth truth truth Bel Loss Bel Bel Bel Bel o2 o2 o3 o2 o3 o4 Prob Prob Prob Prob Prob Prob Mass Mass Mass Mass Mass Mass Bel Bel Bel Bel Bel Bel LLM t1 t2 LLM LLM t3 t4 t1 t1 o2 LLM t1 o2 o3 Fig. 11. Training and generation flow of RS-LLM. Training is performed in a parallel fashion using the ‘teacher forcing’ method [37]. Generation is done sequentially. For each token, the model predicts a belief function. Then the mass function, probability distribution and next token is subsequently computed/sampled from that belief function 8 Conclusions and Future Work In this paper, we outlined our Epistemic AI research programme, aiming at revisiting the foundations of machine learning by effectively modelling epistemic uncertainty. In particular, we focused on how random-set representations can be employed to model epistemic uncertainty in both the target and the parameter space of a neural network, in an epistemic deep learning approach. We illustrated a Random-set CNN approach for predicting belief functions for classification, a ‘wrapper’ method for transforming Bayesian posteriors into continuous belief functions in the parameter space, outlined a methodology for constructing epistemic object detectors for robust computer vision and introduced a random-set setting for Large Language Models with the potential of addressing outstanding issues in natural language processing. Our research is ongoing [11, 92]. As the natural next steps, we will be looking at formalizing a fully-fledged generalization of Bayesian deep learning leveraging the Generalized Bayes Theorem, and devising an epistemic approach to reinforcement learning through a generalization of classical Markov decision processes using random sets. In terms of applications, we will target the modelling of epistemic uncertainty in diffusion models within generative AI, and in graph neural networks and transformer models. An outline of a possible future research programme for random set theory is discussed in [20]. 6 https://www.llama2.ai/ References 1. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 2. R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023. 3. A. Antonucci and G. Corani. The multilabel Naive Credal Classifier. International Journal of Approximate Reasoning, 83:320–336, 2017. 4. A. Antonucci and F. Cuzzolin. Credal sets approximation by lower probabilities: Application to credal networks. In E. Hüllermeier, R. Kruse, and F. Hoffmann, editors, Computational Intelligence for Knowledge-Based Systems Design, volume 6178 of Lecture Notes in Computer Science, pages 716–725. Springer, Berlin Heidelberg, 2010. 5. K. Asadi and M. L. Littman. An alternative softmax operator for reinforcement learning. In International Conference on Machine Learning, pages 243–252. PMLR, 2017. 6. R. J. Aumann. Common priors: A reply to gul. Econometrica, 66(4):929–938, 1998. 7. O. Balabanov and H. Linander. Uncertainty quantification in fine-tuned llms using lora ensembles. arXiv preprint arXiv:2402.12264, 2024. 8. A. Bella, C. Ferri, J. Hernández-Orallo, and M. J. Ramı́rez-Quintana. Calibration of machine learning models. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, pages 128–146. IGI Global, 2010. 9. V. Bengs, E. Hüllermeier, and W. Waegeman. Pitfalls of epistemic uncertainty quantification through loss minimisation. Advances in Neural Information Processing Systems, 35:29205–29216, 2022. 10. C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural network. In International conference on machine learning, pages 1613–1622. PMLR, 2015. 11. M. Caprio, M. Sultana, E. Elia, and F. Cuzzolin. Credal learning theory. In Proceedings of NeurIPS, 2024. 12. B. Charpentier, D. Zügner, and S. Günnemann. Posterior network: Uncertainty estimation without ood samples via density-based pseudo-counts. Advances in neural information processing systems, 33:1356–1367, 2020. 13. A. Chateauneuf and J.-Y. Jaffray. Some characterizations of lower probabilities and other monotone capacities through the use of Möbius inversion. Mathematical Social Sciences, 17(3):263–283, 1989. 14. C. Chen, K. Liu, Z. Chen, Y. Gu, Y. Wu, M. Tao, Z. Fu, and J. Ye. Inside: Llms’ internal states retain the power of hallucination detection. arXiv preprint arXiv:2402.03744, 2024. 15. K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 16. F. Cuzzolin. On the structure of simplex of inner bayesian approximations. In Proceedings of the European Conference on Logics in Artificial Intelligence (JELIA’08), Dresden, Germany. Citeseer, 2008. 17. F. Cuzzolin. Credal semantics of Bayesian transformations in terms of probability intervals. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 40(2):421–432, 2010. 18. F. Cuzzolin. Lp consonant approximations of belief functions. IEEE Transactions on Fuzzy Systems, 22(2):420–436, 2013. 19. F. Cuzzolin. The geometry of uncertainty - The geometry of imprecise probabilities. Springer Nature, 2021. 20. F. Cuzzolin. Reasoning with random sets: An agenda for the future. arXiv preprint arXiv:2401.09435, 2023. 21. F. Cuzzolin. Uncertainty measures: A critical survey. Information Fusion, page 102609, 2024. 22. F. Cuzzolin. Visions of a generalized probability theory. Lambert Academic Publishing, September 2014. 23. F. Cuzzolin and R. Frezza. Geometric analysis of belief space and conditional subspaces. In ISIPTA, pages 122–132, 2001. 24. F. Cuzzolin and M. Sultana. Epistemic Uncertainty in Artificial Intelligence. Springer, 2024. 25. A. P. Dempster. New methods for reasoning towards posterior distributions based on sample data. The Annals of Mathematical Statistics, 37(2):355–374, 1966. 26. A. P. Dempster. Upper and lower probabilities induced by a multivalued mapping. Annals of Mathematical Statistics, 38(2):325–339, 1967. 27. A. P. Dempster. Upper and lower probability inferences based on a sample from a finite univariate population. Biometrika, 54(3-4):515–528, 1967. 28. D. Denneberg and M. Grabisch. Interaction transform of set functions over a finite set. Information Sciences, 121(1-2):149–170, 1999. 29. T. Denœux. A k-nearest neighbor classification rule based on Dempster–Shafer theory. In R. R. Yager and L. Liu, editors, Classic Works of the Dempster-Shafer Theory of Belief Functions, volume 219 of Studies in Fuzziness and Soft Computing, pages 737–760. Springer, 2008. 30. T. Denoeux. Decision-making with belief functions: A review. International Journal of Approximate Reasoning, 109:87–110, 2019. 31. T. Denoeux. Nn-evclus: Neural network-based evidential clustering. Information Sciences, 572:297–330, 2021. 32. D. Dubois and H. Prade. A set-theoretic view of belief functions Logical operations and approximations by fuzzy sets. International Journal of General Systems, 12(3):193–226, 1986. 33. Z. Elouedi, K. Mellouli, and P. Smets. Decision trees using the belief function theory. In Proceedings of the Eighth International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU 2000), volume 1, pages 141– 148, Madrid, 2000. 34. I. Epifanio and G. Ayala. A random set view of texture classification. IEEE Transactions on Image Processing, 11(8):859–867, 2002. 35. Z. ga Liu, J. Dezert, G. Mercier, and Q. Pan. Belief C-means: An extension of fuzzy C-means algorithm in belief functions framework. Pattern Recognition Letters, 33(3):291– 300, 2012. 36. Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059. PMLR, 2016. 37. S. Goodman, N. Ding, and R. Soricut. Teaforn: Teacher-forcing with n-grams. arXiv preprint arXiv:2010.03494, 2020. 38. J. Goutsias, R. P. S. Mahler, and H. T. Nguyen. Random sets: theory and applications, volume 97 of IMA Volumes in Mathematics and Its Applications. Springer-Verlag, December 1997. 39. M. Hobbhahn, A. Kristiadi, and P. Hennig. Fast predictive uncertainty for classification with Bayesian deep networks. In Uncertainty in Artificial Intelligence, pages 822–832. PMLR, 2022. 40. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 41. E. Hüllermeier and W. Waegeman. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110(3):457–506, 2021. 42. L. V. Jospin, H. Laga, F. Boussaid, W. Buntine, and M. Bennamoun. Hands-on bayesian neural networks—a tutorial for deep learning users. IEEE Computational Intelligence Magazine, 17(2):29–48, 2022. 43. A. Kendall and Y. Gal. What uncertainties do we need in Bayesian deep learning for computer vision? arXiv:1703.04977, 2017. 44. S. Khan, I. Teeti, A. Bradley, M. Elhoseiny, and F. Cuzzolin. A hybrid graph network for complex activity detection in video. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6762–6772, 2024. 45. A.-K. Kopetzki, B. Charpentier, D. Zügner, S. Giri, and S. Günnemann. Evaluating robustness of predictive uncertainty estimation: Are dirichlet-based models reliable? In International Conference on Machine Learning, pages 5707–5718. PMLR, 2021. 46. S. B. Kotsiantis, I. Zaharakis, P. Pintelas, et al. Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering, 160(1):3–24, 2007. 47. Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436–444, 2015. 48. I. Levi. The enterprise of knowledge: An essay on knowledge, credal probability, and chance. The MIT Press, Cambridge, Massachusetts, 1980. 49. A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843– 3857, 2022. 50. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 21–37. Springer, 2016. 51. A. Malinin and M. Gales. Predictive uncertainty estimation via prior networks. Advances in neural information processing systems, 31, 2018. 52. A. Malinin and M. Gales. Reverse KL-divergence training of prior networks: Improved uncertainty and adversarial robustness. Advances in Neural Information Processing Systems, 32, 2019. 53. A. Malinin, B. Mlodozeniec, and M. Gales. Ensemble distribution distillation. In International Conference on Learning Representations, 2019. 54. S. K. Manchingal and F. Cuzzolin. Epistemic deep learning. arXiv preprint arXiv:2206.07609, 2022. 55. S. K. Manchingal, M. Mubashar, K. Wang, K. Shariatmadar, and F. Cuzzolin. Randomset convolutional neural network (RS-CNN) for epistemic deep learning. arXiv preprint arXiv:2307.05772, 2023. 56. D. Maulud and A. M. Abdulazeez. A review on linear regression comprehensive in machine learning. Journal of Applied Science and Technology Trends, 1(2):140–147, 2020. 57. J. Maynez, S. Narayan, B. Bohnet, and R. McDonald. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661, 2020. 58. T. Minka. Estimating a dirichlet distribution, 2000. 59. I. S. Molchanov. Theory of random sets, volume 19. Springer, 2005. 60. T. Mortier, M. Wydmuch, K. Dembczyński, E. Hüllermeier, and W. Waegeman. Efficient set-valued prediction in multi-class classification. Data Mining and Knowledge Discovery, 35(4):1435–1469, 2021. 61. H. Nguyen and T. Wang. Belief Functions and Random Sets, pages 243–255. Springer, 01 1997. 62. H. T. Nguyen. On random sets and belief functions. Journal of Mathematical Analysis and Applications, 65:531–542, 1978. 63. H. T. Nguyen. An introduction to random sets. Taylor and Francis, 2006. 64. V.-L. Nguyen, S. Destercke, M.-H. Masson, and E. Hüllermeier. Reliable multi-class classification based on pairwise epistemic and aleatoric uncertainty. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 5089–5095. International Joint Conferences on Artificial Intelligence Organization, 7 2018. 65. L. Oala, C. Heiß, J. Macdonald, M. März, G. Kutyniok, and W. Samek. Detecting failure modes in image reconstructions with interval neural network uncertainty. International Journal of Computer Assisted Radiology and Surgery, 16:2089–2097, 2021. 66. I. Osband, S. M. Asghari, B. Van Roy, N. McAleese, J. Aslanides, and G. Irving. Finetuning language models via epistemic neural networks. arXiv preprint arXiv:2211.01568, 2022. 67. I. Osband, Z. Wen, S. M. Asghari, V. Dwaracherla, M. Ibrahimi, X. Lu, and B. Van Roy. Epistemic neural networks. Advances in Neural Information Processing Systems, 36, 2024. 68. B. Plaut, K. Nguyen, and T. Trinh. Softmax probabilities (mostly) predict large language model correctness on multiple-choice q&a. arXiv preprint arXiv:2402.13213, 2024. 69. M.-C. Popescu, V. E. Balas, L. Perescu-Popescu, and N. Mastorakis. Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems, 8(7):579–588, 2009. 70. J. Redmon. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. 71. S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2016. 72. G. Rogova. Combining the results of several neural network classifiers. Classic Works of the Dempster-Shafer Theory of Belief Functions, pages 683–692, 2008. 73. B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023. 74. T. G. Rudner, Z. Chen, Y. W. Teh, and Y. Gal. Tractable function-space variational inference in Bayesian neural networks. Advances in Neural Information Processing Systems, 35:22686–22698, 2022. 75. M. Sensoy, L. Kaplan, and M. Kandemir. Evidential deep learning to quantify classification uncertainty. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 3183–3193, Red Hook, NY, USA, 2018. Curran Associates Inc. 76. G. Shafer. A mathematical theory of evidence, volume 42. Princeton university press, 1976. 77. P. Smets. The transferable belief model and other interpretations of Dempster–Shafer’s model. In P. P. Bonissone, M. Henrion, L. N. Kanal, and J. F. Lemmer, editors, Uncertainty in Artificial Intelligence, volume 6, pages 375–383. North-Holland, Amsterdam, 1991. 78. P. Smets. The transferable belief model and random sets. International Journal of Intelligent Systems, 7(1):37–46, 1992. 79. P. Smets. Belief functions on real numbers. International Journal of Approximate Reasoning, 40(3):181–223, 2005. 80. P. Smets et al. Constructing the pignistic probability function in a context of uncertainty. In UAI, volume 89, pages 29–40, 1989. 81. D. Soydaner. Attention mechanism in neural networks: where it comes and where it goes. Neural Computing and Applications, 34(16):13371–13385, 2022. 82. V. Spruyt. How to draw a covariance error ellipse. 2014. Computer Vision for Dummies.[Available Online-http://www. visiondummy. com/2014/04/drawerror-ellipserepresenting-covariance-matrix/], 2013. 83. M. Stadler, B. Charpentier, S. Geisler, D. Zügner, and S. Günnemann. Graph posterior network: Bayesian predictive uncertainty for node classification. Advances in Neural Information Processing Systems, 34:18033–18048, 2021. 84. V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295–2329, 2017. 85. Z. Tong, P. Xu, and T. Denoeux. An evidential classifier based on Dempster-Shafer theory and deep learning. Neurocomputing, 450:275–293, 2021. 86. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 87. G. Tsoumakas, I. Katakis, and I. Vlahavas. Random k-labelsets for multilabel classification. IEEE transactions on knowledge and data engineering, 23(7):1079–1089, 2010. 88. D. T. Ulmer, C. Hardmeier, and J. Frellsen. Prior and posterior networks: A survey on evidential deep learning methods for uncertainty estimation. Transactions on Machine Learning Research, 2023. 89. L. van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579–2605, 2008. 90. A. Wallner. Maximal number of vertices of polytopes defined by f-probabilities. In F. G. Cozman, R. Nau, and T. Seidenfeld, editors, Proceedings of the Fourth International Symposium on Imprecise Probabilities and Their Applications (ISIPTA 2005), pages 388–395, 2005. 91. H. Wang and D.-Y. Yeung. A survey on bayesian deep learning. ACM computing surveys (csur), 53(5):1–37, 2020. 92. K. Wang, F. Cuzzolin, K. Shariatmadar, S. Manchingal, D. Moens, and H. Hallez. Credal deep ensembles for uncertainty quantification. In Proceedings of NeurIPS, 2024. 93. Y. Wang, H. Shi, L. Han, D. Metaxas, and H. Wang. Blob: Bayesian low-rank adaptation by backpropagation for large language models. arXiv preprint arXiv:2406.11675, 2024. 94. L. A. Wasserman. Belief functions and statistical inference. Canadian Journal of Statistics, 18(3):183–196, 1990. 95. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 96. A. G. Wilson. The case for bayesian deep learning. arXiv preprint arXiv:2001.10995, 2020. 97. T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q.-L. Han, and Y. Tang. A brief overview of chatgpt: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 10(5):1122–1136, 2023. 98. B. J. Wythoff. Backpropagation neural networks: a tutorial. Chemometrics and Intelligent Laboratory Systems, 18(2):115–155, 1993. 99. L. Xu, A. Krzyzak, and C. Suen. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on Systems, Man, and Cybernetics, 22(3):418–435, 1992. 100. A. X. Yang, M. Robeyns, X. Wang, and L. Aitchison. Bayesian low-rank adaptation for large language models. arXiv preprint arXiv:2308.13111, 2023. 101. B. Yegnanarayana. Artificial neural networks. PHI Learning Pvt. Ltd., 2009. 102. M. Zaffalon. The Naive Credal Classifier. Journal of Statistical Planning and Inference - J STATIST PLAN INFER, 105:5–21, 06 2002. 103. M. Zaffalon and E. Fagiuoli. Tree-based credal networks for classification. Reliable computing, 9(6):487–509, 2003.