Active Learning in Large Scale Vector Optimization Problems


Program: Scientific and Technological Research Council of Turkey (TÜBİTAK) 1001 Program

Principal Investigator: Cem Tekin

Researcher: Çağın Ararat

Graduate Student Contributors: Artun Saday, İlter Onat Korkmaz, Yaşar Cahit Yıldırım

Undergraduate Student Contributors: Efe Mert Karagözlü


Project Summary:

Many complex scientific problems, including vaccine development for pandemic diseases and deep neural architecture search, require identifying optimal designs that satisfy certain performance criteria from a design set that includes many alternative designs via sequential experimentation. This problem has been mainly investigated in the literature when there is a single performance objective to be optimized. It has been studied under the names active learning, Bayesian experimental design, Bayesian optimization, etc. Due to their complex nature, many real-world applications of active learning require the optimization of multiple objectives. However, finding a design that simultaneously maximizes each objective under this setting may not be possible. Therefore, for the experts to determine which designs to focus their research on, it is necessary to identify Pareto optimal designs that are not dominated by other designs.

It is very difficult for domain experts to design optimal sequential experiments for problems that involve large design sets, multiple performance objectives intricately tied with each other, and restrictions on observations of design performances such as partial and noisy observations. The goal of this project is to design active learning algorithms that are mathematically rigorously founded, provably efficient and scalable, for complex learning problems in which identification of designs that optimize conflicting objectives by sequential experimentation via using the minimum amount of resources is required, given that the performances of the designs under different objectives were unknown initially. The algorithms developed in this project will enable experts in various disciplines, ranging from biology to data science, to quickly identify optimal designs in a small number of evaluations and use the minimum amount of resources. The results of this project will pave the path for having efficient and fast solutions for many scientific problems in different fields that require many trials and a lot of resources.


VectOptAL: A Library for Vector Optimization with Active Learning


We present the VectOptAL library, aimed at providing a platform for rapid development of vector optimization algorithms. It is designed to help researchers facilitate their algorithm design process. The VectOptAL library is now publicly available for access on our GitHub repository.

Vector Optimization with Stochastic Bandit Feedback


Summary

  • We consider Pareto set identification in vector optimization problems with \(K\) designs, each having \(D\)-dimensional mean vectors. \[\begin{aligned} \text{maximize} ~ \underbrace{\boldsymbol{\mu}_i}_{\text{unknown}} ~ \text{w.r.t. cone} ~C~ \text{over} ~i \in [K]~ \end{aligned}\]

  • This generalizes Pareto set identification in multi-objective optimization problems with unknown mean vectors. \[\begin{aligned} \text{maximize} ~\underbrace{\boldsymbol{\mu}_i}_{\text{unknown}}~ \text{w.r.t. cone} ~\hspace{-2em}\underbrace{\mathbb{R}^D_{+}}_{\text{componentwise order}}\hspace{-2em}~ \text{over} ~i \in [K]~ \end{aligned}\]

  • Cone \(C \subset \mathbb{R}^D\) encodes preferences over designs.

    • Left: three cones in \(\mathbb{R}^2\) represented by their angular widths: \(C_{\pi/4}\) (orange), \(C_{\pi/2}\) (blue), \(C_{3\pi/4}\) (green). Note: \(C_{\pi/2} = \mathbb{R}^2_{+}\).

    • Right: Pareto sets under these cones on an example design set: \(P^*_{\pi/4}\) (orange), \(P^*_{\pi/2}\) (blue), \(P^*_{3\pi/4}\) (green). Note: \(P^*_{3\pi/4} \subseteq P^*_{\pi/2} \subseteq P^*_{\pi/4}\).

  • We address the following fundamental question: What is the sample complexity of Pareto set identification and how does it depend on \(C\)?

  • Our results:

    • Introduced a cone-dependent ordering-complexity term, \(\beta = \max \{\beta_1, \beta_2 \}\), to characterize the learning difficulty of vector optimization problems.

    • \(\Omega \left( K \frac{\beta_2^2}{\epsilon^2} \log \left(\frac{1}{\delta} \right) \right)\) worst-case lower bound on sample complexity of \((\epsilon, \delta)\)-PAC algorithms.

    • Introduced cone-dependent suboptimality gaps.

    • \(\Omega \left( \sum_{i \in [K]} \frac{1}{ (\tilde{\Delta}^\epsilon_i)^2 } \log \left(\frac{1}{\delta} \right) \right)\) gap-dependent lower bound on sample complexity of \((\epsilon, \delta)\)-PAC algorithms.

    • Proposed an \((\epsilon, \delta)\)-PAC algorithm with \(O \left( K \frac{\beta^2}{\epsilon^2} \log \left(\frac{1}{\delta} \right) \right)\) sample complexity.


\((\epsilon, \delta)\)-PAC Pareto Set Identification Problem

Definition 1. (Polyhedral ordering cone) A cone \(C\) is called polyhedral if it can be written as \(C=\{\boldsymbol{x}\in \mathbb{R}^D\mid W\boldsymbol{x}\geq 0\}\) for some \(N\times D\) real matrix \(W\) with rows \(\boldsymbol{w}^{\mathsf{T}}_1, \ldots, \boldsymbol{w}^{\mathsf{T}}_N\), and positive integer \(N\).

Ordering complexity of \(C\): \(\beta := \max\{ \beta_1, \beta_2 \}\), where \[\begin{aligned} ~ \beta_1:=\sup_{\boldsymbol{x}\notin C}\frac{d(\boldsymbol{x},C\cap(\boldsymbol{x}+C))}{d(\boldsymbol{x},C)}, \quad \beta_2:=\sup_{\boldsymbol{x}\in \mathop{\mathrm{int}}(C)}\frac{d(\boldsymbol{x},(\mathop{\mathrm{int}}(C))^c\cap(\boldsymbol{x}-C))}{d(\boldsymbol{x},(\mathop{\mathrm{int}}(C))^c)} ~.\end{aligned}\]

Theorem 1. (Finiteness of ordering complexity) (i) When \(C\) is a polyhedral ordering cone without redundancy and \(|| \boldsymbol{w}_n ||_2 = 1\), \(n \in [N]\), it holds \(\beta_1<+\infty\), \(\beta_2 < +\infty\). (ii) If \(C\supseteq \mathbb{R}^D_+\) too, then \(\beta_1=\beta_2=1\).

Definition 2. (Exact Pareto set) A design \(i\in [K]\) is called Pareto optimal if it is not dominated by any other design with respect to the ordering induced by \(C\). The Pareto set \(P^\ast := \{i \in [K] \mid \nexists j\in [K]\colon i \preceq_{C \setminus \{ \boldsymbol{0} \}} j \}\) consists of all Pareto optimal designs.

Gaps for design \(i\notin P^\ast\): \(\Delta_i^\ast:=\max_{j\in P^\ast} d(\boldsymbol{\mu}_j-\boldsymbol{\mu}_i,(\mathop{\mathrm{int}}C)^c\cap (\boldsymbol{\mu}_j-\boldsymbol{\mu}_i-C))\), \(\tilde{\Delta}_i^\epsilon=\max\{\Delta_i^\ast,\epsilon\}\).

Gaps for design \(i\in P^\ast\): \(\Delta_i^\ast:=\min_{j\in P^\ast\setminus\{i\}} d(\boldsymbol{\mu}_j-\boldsymbol{\mu}_i,C\cap(\boldsymbol{\mu}_j-\boldsymbol{\mu}_i+C))\), \(\tilde{\Delta}_i^\epsilon=\max\{\Delta_i^\ast,\epsilon\}\).

Definition 3. (\((\epsilon,\delta)\)-PAC Pareto set) Let \(\epsilon>0\), \(\delta \in (0,1)\). A set \(P\subseteq[K]\) is called an \((\epsilon,\delta)\)-PAC Pareto set if the following properties hold with probability at least \(1-\delta\): (\(i\)) \(\cup_{i \in P} (\boldsymbol{\mu}_i + (B(\boldsymbol{0}, \epsilon)\cap C) - C) \supseteq \cup_{i \in P^*} (\boldsymbol{\mu}_i - C)\); (\(ii\)) for every \(i \in P \setminus P^\ast\), it holds \(\Delta^*_i \leq \epsilon\).

The learning problem: Mean vectors \(\boldsymbol{\mu}_i\), \(i \in [K]\) are unknown beforehand. Designs are evaluated sequentially. At round \(t\), design \(I_t\) is evaluated. Evaluation yields a random reward vector \(\boldsymbol{X}_t = \boldsymbol{\mu}_{I_t} + \boldsymbol{Y}_{I_t,t}\), where \(\boldsymbol{Y}_{i,t}\) is random norm-subgaussian noise vector with parameter \(\sigma \geq 0\).

Given \(\epsilon>0\), \(\delta \in (0,1)\), how many evaluations are required to identify an \((\epsilon,\delta)\)-PAC Pareto set?


Sample Complexity Lower Bounds

Let \(\mathcal{A}\) be the class of all algorithms that return an \((\epsilon,\delta)\)-PAC Pareto set.

Theorem 2. (Gap-dependent lower bound) There exist norm-subgaussian noise distributions with the property that every algorithm in \(\mathcal{A}\) requires \(\Omega(\sum_{i\in [K]}\frac{1}{(\tilde{\Delta}^\epsilon_i)^2}\log(\frac{1}{\delta}))\) samples to work correctly.

Theorem 3. (Worst-case lower bound) There exist mean reward vectors and norm-subgaussian noise distributions under which any algorithm in \(\mathcal{A}\) requires \(\Omega \left( K \frac{\beta_2^2}{\epsilon^2} \log \left( \frac{1}{\delta} \right) \right)\) samples to work correctly.

  • Requires utilizing the geometry of the ordering cone to construct a special worst-case problem instance that is hard to distinguish from other similar problem instances.

  • When the noise is in a special direction determined by \(\beta_2\), a change in its distribution that is proportional to \(\epsilon/\beta_2\) will result in a change in the returned set of designs.

  • Manifests a distinct feature of vector optimization which is not present in multi-objective setting, i.e., geometry of the ordering cone characterizes the hardness of the Pareto set identification problem.


Naïve Elimination Algorithm

  • Selects each design \(L\) times.

  • Returns the Pareto set based on sample mean reward vectors.

  • Per-design sampling budget for \((\epsilon,\delta)\): \[\begin{aligned} g(\epsilon,\delta) := \Big\lceil \frac{4 \beta^2 c^2 \sigma^2}{\epsilon^2} \log \Big(\frac{4 D}{\delta}\Big) \Big\rceil \end{aligned}\]

Theorem 4. When naïve elimination is run with \(L= g(\epsilon,2\delta/(K (K-1))\), the returned Pareto set \(P\) is an \((\epsilon,\delta)\)-PAC Pareto set.

Sample complexity scales as \(O\left( K \frac{\beta^2}{\epsilon^2} \log \left( \frac{1}{\delta} \right) \right)\) (nearly matches the lower bound in Theorem 3).

Success rate of naïve elimination on SNW dataset:


Related publication:

Ç. Ararat, C. Tekin, "Vector Optimization with Stochastic Bandit Feedback", Proc. 26th International Conference on Artificial Intelligence and Statistics (AISTATS), April 2023.