Scholar's Hub

Award-Winning Papers: AI & Theory

These papers have received best paper awards or distinguished paper awards from renowned computer science conferences in the Artificial Intelligence and Theory fields.

This collection is sourced from each conference. If you notice any errors, please contact us.

Illustration: Trending Papers

AI

AAAI
ACL

What the DAAM: Interpreting Stable Diffusion Using Cross Attention

  • Raphael Tang, Akshat Pandey, Zhiying Jiang, Gefei Yang, K. Kumar, Jimmy Lin, Ferhan Ture

  • Annual Meeting of the Association for Computational Linguistics

  • October 10, 2022

Diffusion models are a milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce attribution maps, we upscale and aggregate cross-attention maps in the denoising module, naming our method DAAM. We validate it by testing its segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. On two generated datasets, we attain a competitive 58.8-64.8 mIoU on noun segmentation and fair to good mean opinion scores (3.4-4.2) on generalized attribution. Then, we apply DAAM to study the role of syntax in the pixel space across head–dependent heat map interaction patterns for ten common dependency relations. We show that, for some relations, the head map consistently subsumes the dependent, while the opposite is true for others. Finally, we study several semantic phenomena, focusing on feature entanglement; we find that the presence of cohyponyms worsens generation quality by 9%, and descriptive adjectives attend too broadly. We are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future research. Our code is at https://github.com/castorini/daam.

TLDR

The first to interpret large diffusion models from a visuolinguistic perspective, which enables future research, and shows that, for some relations, the head map consistently subsumes the dependent, while the opposite is true for others.

Do Androids Laugh at Electric Sheep? Humor “Understanding” Benchmarks from The New Yorker Caption Contest

  • Jack Hessel, Ana Marasović, Jena D. Hwang, Lillian Lee, Jeff Da, Rowan Zellers, Robert Mankoff, Yejin Choi

  • Annual Meeting of the Association for Computational Linguistics

  • September 13, 2022

Large neural networks can now generate jokes, but do they really “understand” humor? We challenge AI models with three tasks derived from the New Yorker Cartoon Caption Contest: matching a joke to a cartoon, identifying a winning caption, and explaining why a winning caption is funny. These tasks encapsulate progressively more sophisticated aspects of “understanding” a cartoon; key elements are the complex, often surprising relationships between images and captions and the frequent inclusion of indirect and playful allusions to human experience and culture. We investigate both multimodal and language-only models: the former are challenged with the cartoon images directly, while the latter are given multifaceted descriptions of the visual scene to simulate human-level visual understanding. We find that both types of models struggle at all three tasks. For example, our best multimodal models fall 30 accuracy points behind human performance on the matching task, and, even when provided ground-truth visual scene descriptors, human-authored explanations are preferred head-to-head over the best machine-authored ones (few-shot GPT-4) in more than 2/3 of cases. We release models, code, leaderboard, and corpus, which includes newly-gathered annotations describing the image’s locations/entities, what’s unusual in the scene, and an explanation of the joke.

TLDR

This work challenges AI models with three tasks derived from the New Yorker Cartoon Caption Contest: matching a joke to a cartoon, identifying awinning caption, and explaining why a winning caption is funny.

CIKM

D-HYPR: Harnessing Neighborhood Modeling and Asymmetry Preservation for Digraph Representation Learning

  • Honglu Zhou, Advith Chegu, Samuel S. Sohn, Zuohui Fu, Gerard de Melo, M. Kapadia

  • Proceedings of the 31st ACM International Conference on Information & Knowledge Management

  • December 22, 2021

Digraph Representation Learning (DRL) aims to learn representations for directed homogeneous graphs (digraphs). Prior work in DRL is largely constrained (e.g., limited to directed acyclic graphs), or has poor generalizability across tasks (e.g., evaluated solely on one task). Most Graph Neural Networks (GNNs) exhibit poor performance on digraphs due to the neglect of modeling neighborhoods and preserving asymmetry. In this paper, we address these notable challenges by leveraging hyperbolic collaborative learning from multi-ordered and partitioned neighborhoods, and regularizers inspired by socio-psychological factors. Our resulting formalism, Digraph Hyperbolic Networks (D-HYPR) -- albeit conceptually simple -- generalizes to digraphs where cycles and non-transitive relations are common, and is applicable to multiple downstream tasks including node classification, link presence prediction, and link property prediction. In order to assess the effectiveness of D-HYPR, extensive evaluations were performed across 8 real-world digraph datasets involving 21 prior techniques. D-HYPR statistically significantly outperforms the current state of the art. We release our code at https://github.com/hongluzhou/dhypr

TLDR

The resulting formalism, Digraph Hyperbolic Networks (D-HYPR) -- albeit conceptually simple -- generalizes to digraphs where cycles and non-transitive relations are common, and is applicable to multiple downstream tasks including node classification, link presence prediction, and link property prediction.

CVPR

Planning-oriented Autonomous Driving

  • Yi Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wen Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, Hongyang Li

  • December 20, 2022

Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction, and planning. In order to perform a wide diversity of tasks and achieve advanced-level intelligence, contemporary approaches either deploy standalone models for individual tasks, or design a multi-task paradigm with separate heads. However, they might suffer from accumulative errors or deficient task coordination. Instead, we argue that a favorable framework should be devised and optimized in pursuit of the ultimate goal, i.e., planning of the self-driving car. Oriented at this, we revisit the key components within perception and prediction, and prioritize the tasks such that all these tasks contribute to planning. We introduce Unified Autonomous Driving (UniAD), a comprehensive framework up-to-date that incorporates full-stack driving tasks in one network. It is exquisitely devised to leverage advantages of each module, and provide complementary feature abstractions for agent interaction from a global perspective. Tasks are communicated with unified query interfaces to facilitate each other toward planning. We instantiate UniAD on the challenging nuScenes benchmark. With extensive ablations, the effectiveness of using such a philosophy is proven by substantially outperforming previous state-of-the-arts in all aspects. Code and models are public.

TLDR

This work introduces Unified Autonomous Driving (UniAD), a comprehensive framework up-to-date that incorporates full-stack driving tasks in one network and is exquisitely devised to leverage advantages of each module, and provide complementary feature abstractions for agent interaction from a global perspective.

EMNLP

Faster Minimum Bayes Risk Decoding with Confidence-based Pruning

  • Julius Cheng, Andreas Vlachos

  • Conference on Empirical Methods in Natural Language Processing

  • November 25, 2023

Minimum Bayes risk (MBR) decoding outputs the hypothesis with the highest expected utility over the model distribution for some utility function. It has been shown to improve accuracy over beam search in conditional language generation problems and especially neural machine translation, in both human and automatic evaluations. However, the standard sampling-based algorithm for MBR is substantially more computationally expensive than beam search, requiring a large number of samples as well as a quadratic number of calls to the utility function, limiting its applicability. We describe an algorithm for MBR which gradually grows the number of samples used to estimate the utility while pruning hypotheses that are unlikely to have the highest utility according to confidence estimates obtained with bootstrap sampling. Our method requires fewer samples and drastically reduces the number of calls to the utility function compared to standard MBR while being statistically indistinguishable in terms of accuracy. We demonstrate the effectiveness of our approach in experiments on three language pairs, using chrF++ and COMET as utility/evaluation metrics.

TLDR

This work describes an algorithm for MBR which gradually grows the number of samples used to estimate the utility while pruning hypotheses that are unlikely to have the highest utility according to confidence estimates obtained with bootstrap sampling.

HRI

Lively: Enabling Multimodal, Lifelike, and Extensible Real-time Robot Motion

  • Andrew Schoen, Dakota Sullivan, Ze-dong Zhang, D. Rakita, Bilge Mutlu

  • Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction

  • March 13, 2023

Robots designed to interact with people in collaborative or social scenarios must move in ways that are consistent with the robot's task and communication goals. However, combining these goals in a naïve manner can result in mutually exclusive solutions, or infeasible or problematic states and actions. In this paper, we present Lively, a framework which supports configurable, real-time, task-based and communicative or socially-expressive motion for collaborative and social robotics across multiple levels of programmatic accessibility. Lively supports a wide range of control methods (i.e. position, orientation, and joint-space goals), and balances them with complex procedural behaviors for natural, lifelike motion that are effective in collaborative and social contexts. We discuss the design of three levels of programmatic accessibility of Lively, including a graphical user interface for visual design called LivelyStudio, the core library Lively for full access to its capabilities for developers, and an extensible architecture for greater customizability and capability.

TLDR

This paper discusses the design of three levels of programmatic accessibility of Lively, including a graphical user interface for visual design called LivelyStudio, the core library Lively for full access to its capabilities for developers, and an extensible architecture for greater customizability and capability.

Interactive Policy Shaping for Human-Robot Collaboration with Transparent Matrix Overlays

  • Jake Brawer, Debasmita Ghose, Kate Candon, Meiying Qin, A. Roncone, Marynel Vázquez, B. Scassellati

  • Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction

  • March 13, 2023

One important aspect of effective human--robot collaborations is the ability for robots to adapt quickly to the needs of humans. While techniques like deep reinforcement learning have demonstrated success as sophisticated tools for learning robot policies, the fluency of human-robot collaborations is often limited by these policies' inability to integrate changes to a user's preferences for the task. To address these shortcomings, we propose a novel approach that can modify learned policies at execution time via symbolic if-this-then-that rules corresponding to a modular and superimposable set of low-level constraints on the robot's policy. These rules, which we call Transparent Matrix Overlays, function not only as succinct and explainable descriptions of the robot's current strategy but also as an interface by which a human collaborator can easily alter a robot's policy via verbal commands. We demonstrate the efficacy of this approach on a series of proof-of-concept cooking tasks performed in simulation and on a physical robot.

TLDR

A novel approach that can modify learned policies at execution time via symbolic if-this-then-that rules corresponding to a modular and superimposable set of low-level constraints on the robot's policy is proposed.

Exploring Machine-like Behaviors for Socially Acceptable Robot Navigation in Elevators

  • Danilo Gallo, Shreepriya Gonzalez Jimenez, Antonietta Grasso, Cécile Boulard, T. Colombino

  • 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI)

  • March 7, 2022

In this paper, we present our ongoing research on socially acceptable robot navigation for an indoor elevator sharing scenario. Informed by naturalistic observations of human elevator use, we discuss the social nuances involved in a seemingly simple activity like taking an elevator and the challenges and limitations of modeling robot behaviors based on a full human-like approach. We propose the principle of machine-like for the design of robot behavior policies that effectively accomplish tasks without being disruptive to the routines of people sharing the elevator with the robots. We explored this approach in a bodystorming session and conducted a preliminary evaluation of the resulting considerations through an online user study. Participants differentiated robots from humans for issues of proxemics and priority, and machine-like behaviors were preferred over human-like behaviors. We present our findings and discuss the advantages and limitations identified for both approaches for designing socially acceptable navigation behaviors.

TLDR

The principle of machine-like is proposed for the design of robot behavior policies that effectively accomplish tasks without being disruptive to the routines of people sharing the elevator with the robots.

MIND MELD: Personalized Meta-Learning for Robot-Centric Imitation Learning

  • Mariah L. Schrum, Erin Hedlund-Botti, Nina Moorman, M. Gombolay

  • 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI)

  • March 7, 2022

Learning from demonstration (LfD) techniques seek to enable users without computer programming experience to teach robots novel tasks. There are generally two types of LfD: human- and robot-centric. While human-centric learning is intuitive, human centric learning suffers from performance degradation due to covariate shift. Robot-centric approaches, such as Dataset Aggregation (DAgger), address covariate shift but can struggle to learn from suboptimal human teachers. To create a more human-aware version of robot-centric LfD, we present Mutual Information-driven Meta-learning from Demonstration (MIND MELD). MIND MELD meta-learns a mapping from suboptimal and heterogeneous human feedback to optimal labels, thereby improving the learning signal for robot-centric LfD. The key to our approach is learning an informative personalized em-bedding using mutual information maximization via variational inference. The embedding then informs a mapping from human provided labels to optimal labels. We evaluate our framework in a human-subjects experiment, demonstrating that our approach improves corrective labels provided by human demonstrators. Our framework outperforms baselines in terms of ability to reach the goal $(p <. 001)$, average distance from the goal $(p=.006)$, and various subjective ratings $(p=.008)$.

TLDR

MIND MELD meta- learns a mapping from suboptimal and heterogeneous human feedback to optimal labels, thereby improving the learning signal for robot-centric LfD, and is evaluated in a human-subjects experiment, demonstrating that the approach improves corrective labels provided by human demonstrators.

REGROUP: A Robot-Centric Group Detection and Tracking System

  • Angelique Taylor, L. Riek

  • 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI)

  • March 7, 2022

To facilitate HRI's transition from dyadic to group interaction, new methods are needed for robots to sense and understand team behavior. We introduce the Robot-Centric Group Detection and Tracking System (REGROUP), a new method that enables robots to detect and track groups of people from an ego-centric perspective using a crowd-aware, tracking-by-detection approach. Our system employs a novel technique that leverages person re-identification deep learning features to address the group data association problem. REGROUP is robust to real-world vision challenges such as occlusion, camera egomotion, shadow, and varying lighting illuminations. Also, it runs in real-time on real-world data. We show that REGROUP outperformed three group detection methods by up to 40% in terms of precision and up to 18 % in terms of recall. Also, we show that REGROUP's group tracking method outperformed three state-of-the-art methods by up to 66% in terms of tracking accuracy and 20% in terms of tracking precision. We plan to publicly release our system to support HRI teaming research and development. We hope this work will enable the development of robots that can more effectively locate and perceive their teammates, particularly in uncertain, unstructured environments.

TLDR

The Robot-Centric Group Detection and Tracking System (REGROUP), a new method that enables robots to detect and track groups of people from an ego-centric perspective using a crowd-aware, tracking-by-detection approach, employs a novel technique that employs person re-identification deep learning features to address the group data association problem.

ICLR

Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching

  • Donggyun Kim, Jinwoo Kim, Seongwoong Cho, Chong Luo, Seunghoon Hong

  • ArXiv

  • March 27, 2023

Dense prediction tasks are a fundamental class of problems in computer vision. As supervised methods suffer from high pixel-wise labeling cost, a few-shot learning solution that can learn any dense task from a few labeled images is desired. Yet, current few-shot learning methods target a restricted set of tasks such as semantic segmentation, presumably due to challenges in designing a general and unified model that is able to flexibly and efficiently adapt to arbitrary tasks of unseen semantics. We propose Visual Token Matching (VTM), a universal few-shot learner for arbitrary dense prediction tasks. It employs non-parametric matching on patch-level embedded tokens of images and labels that encapsulates all tasks. Also, VTM flexibly adapts to any task with a tiny amount of task-specific parameters that modulate the matching algorithm. We implement VTM as a powerful hierarchical encoder-decoder architecture involving ViT backbones where token matching is performed at multiple feature hierarchies. We experiment VTM on a challenging variant of Taskonomy dataset and observe that it robustly few-shot learns various unseen dense prediction tasks. Surprisingly, it is competitive with fully supervised baselines using only 10 labeled examples of novel tasks (0.004% of full supervision) and sometimes outperforms using 0.1% of full supervision. Codes are available at https://github.com/GitGyun/visual_token_matching.

TLDR

This work proposes Visual Token Matching (VTM), a universal few-shot learner for arbitrary dense prediction tasks that employs non-parametric matching on patch-level embedded tokens of images and labels that encapsulates all tasks and flexibly adapts to any task with a tiny amount of task-specific parameters that modulate the matching algorithm.

Emergence of Maps in the Memories of Blind Navigation Agents

  • Erik Wijmans, M. Savva, Irfan Essa, Stefan Lee, Ari S. Morcos, Dhruv Batra

  • ArXiv

  • January 30, 2023

Animal navigation research posits that organisms build and maintain internal spatial representations, or maps, of their environment. We ask if machines -- specifically, artificial intelligence (AI) navigation agents -- also build implicit (or 'mental') maps. A positive answer to this question would (a) explain the surprising phenomenon in recent literature of ostensibly map-free neural-networks achieving strong performance, and (b) strengthen the evidence of mapping as a fundamental mechanism for navigation by intelligent embodied agents, whether they be biological or artificial. Unlike animal navigation, we can judiciously design the agent's perceptual system and control the learning paradigm to nullify alternative navigation mechanisms. Specifically, we train 'blind' agents -- with sensing limited to only egomotion and no other sensing of any kind -- to perform PointGoal navigation ('go to $\Delta$ x, $\Delta$ y') via reinforcement learning. Our agents are composed of navigation-agnostic components (fully-connected and recurrent neural networks), and our experimental setup provides no inductive bias towards mapping. Despite these harsh conditions, we find that blind agents are (1) surprisingly effective navigators in new environments (~95% success); (2) they utilize memory over long horizons (remembering ~1,000 steps of past experience in an episode); (3) this memory enables them to exhibit intelligent behavior (following walls, detecting collisions, taking shortcuts); (4) there is emergence of maps and collision detection neurons in the representations of the environment built by a blind agent as it navigates; and (5) the emergent maps are selective and task dependent (e.g. the agent 'forgets' exploratory detours). Overall, this paper presents no new techniques for the AI audience, but a surprising finding, an insight, and an explanation.

TLDR

This paper trains 'blind' agents -- with sensing limited to only egomotion and no other sensing of any kind -- to perform PointGoal navigation via reinforcement learning, and finds that blind agents are surprisingly effective navigators in new environments.

DreamFusion: Text-to-3D using 2D Diffusion

  • Ben Poole, Ajay Jain, J. Barron, B. Mildenhall

  • ArXiv

  • September 29, 2022

Recent breakthroughs in text-to-image synthesis have been driven by diffusion models trained on billions of image-text pairs. Adapting this approach to 3D synthesis would require large-scale datasets of labeled 3D data and efficient architectures for denoising 3D data, neither of which currently exist. In this work, we circum-vent these limitations by using a pretrained 2D text-to-image diffusion model to perform text-to-3D synthesis. We introduce a loss based on probability density distillation that enables the use of a 2D diffusion model as a prior for optimization of a parametric image generator. Using this loss in a DeepDream-like procedure, we optimize a randomly-initialized 3D model (a Neural Radiance Field, or NeRF) via gradient descent such that its 2D renderings from random angles achieve a low loss. The resulting 3D model of the given text can be viewed from any angle, relit by arbitrary illumination, or composited into any 3D environment. Our approach requires no 3D training data and no modifications to the image diffusion model, demonstrating the effectiveness of pretrained image diffusion models as priors. See dreamfusion3d.github.io for a more immersive view into our 3D results.

TLDR

This work introduces a loss based on probability density distillation that enables the use of a 2D diffusion model as a prior for optimization of a parametric image generator and optimize a randomly-initialized 3D model via gradient descent such that its 2D renderings from random angles achieve a low loss.

Expressiveness and Approximation Properties of Graph Neural Networks

  • F. Geerts, Juan L. Reutter

  • ArXiv

  • April 10, 2022

Characterizing the separation power of graph neural networks (GNNs) provides an understanding of their limitations for graph learning tasks. Results regarding separation power are, however, usually geared at specific GNN architectures, and tools for understanding arbitrary GNN architectures are generally lacking. We provide an elegant way to easily obtain bounds on the separation power of GNNs in terms of the Weisfeiler-Leman (WL) tests, which have become the yardstick to measure the separation power of GNNs. The crux is to view GNNs as expressions in a procedural tensor language describing the computations in the layers of the GNNs. Then, by a simple analysis of the obtained expressions, in terms of the number of indexes and the nesting depth of summations, bounds on the separation power in terms of the WL-tests readily follow. We use tensor language to define Higher-Order Message-Passing Neural Networks (or k-MPNNs), a natural extension of MPNNs. Furthermore, the tensor language point of view allows for the derivation of universality results for classes of GNNs in a natural way. Our approach provides a toolbox with which GNN architecture designers can analyze the separation power of their GNNs, without needing to know the intricacies of the WL-tests. We also provide insights in what is needed to boost the separation power of GNNs.

TLDR

The approach provides a toolbox with which GNN architecture designers can analyze the separation power of their GNNs, without needing to know the intricacies of the WL-tests, and uses tensor language to define Higher-Order Message-Passing Neural Networks (or k-MPNNs), a natural extension of MPNNs.

Learning strides in convolutional neural networks

  • Rachid Riad, O. Teboul, David Grangier, Neil Zeghidour

  • ArXiv

  • February 3, 2022

Convolutional neural networks typically contain several downsampling operators, such as strided convolutions or pooling layers, that progressively reduce the resolution of intermediate representations. This provides some shift-invariance while reducing the computational complexity of the whole architecture. A critical hyperparameter of such layers is their stride: the integer factor of downsampling. As strides are not differentiable, finding the best configuration either requires cross-validation or discrete optimization (e.g. architecture search), which rapidly become prohibitive as the search space grows exponentially with the number of downsampling layers. Hence, exploring this search space by gradient descent would allow finding better configurations at a lower computational cost. This work introduces DiffStride, the first downsampling layer with learnable strides. Our layer learns the size of a cropping mask in the Fourier domain, that effectively performs resizing in a differentiable way. Experiments on audio and image classification show the generality and effectiveness of our solution: we use DiffStride as a drop-in replacement to standard downsampling layers and outperform them. In particular, we show that introducing our layer into a ResNet-18 architecture allows keeping consistent high performance on CIFAR10, CIFAR100 and ImageNet even when training starts from poor random stride configurations. Moreover, formulating strides as learnable variables allows us to introduce a regularization term that controls the computational complexity of the architecture. We show how this regularization allows trading off accuracy for efficiency on ImageNet.

TLDR

The first downsampling layer with learnable strides, DiffStride, which learns the size of a cropping mask in the Fourier domain, that effectively performs resizing in a differentiable way and allows trading off accuracy for efficiency on ImageNet.

Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models

  • Fan Bao, Chongxuan Li, Jun Zhu, Bo Zhang

  • ArXiv

  • January 17, 2022

Diffusion probabilistic models (DPMs) represent a class of powerful generative models. Despite their success, the inference of DPMs is expensive since it generally needs to iterate over thousands of timesteps. A key problem in the inference is to estimate the variance in each timestep of the reverse process. In this work, we present a surprising result that both the optimal reverse variance and the corresponding optimal KL divergence of a DPM have analytic forms w.r.t. its score function. Building upon it, we propose Analytic-DPM, a training-free inference framework that estimates the analytic forms of the variance and KL divergence using the Monte Carlo method and a pretrained score-based model. Further, to correct the potential bias caused by the score-based model, we derive both lower and upper bounds of the optimal variance and clip the estimate for a better result. Empirically, our analytic-DPM improves the log-likelihood of various DPMs, produces high-quality samples, and meanwhile enjoys a 20x to 80x speed up.

TLDR

Analytic-DPM is proposed, a training-free inference framework that estimates the analytic forms of the variance and KL divergence using the Monte Carlo method and a pretrained score-based model, and improves the log-likelihood of various DPMs, produces high-quality samples, and meanwhile enjoys a 20x to 80x speed up.

Bootstrapped Meta-Learning

  • Sebastian Flennerhag, Yannick Schroecker, Tom Zahavy, Hado Philip van Hasselt, David Silver, Satinder Singh

  • ArXiv

  • September 9, 2021

Meta-learning empowers artificial intelligence to increase its efficiency by learning how to learn. Unlocking this potential involves overcoming a challenging meta-optimisation problem. We propose an algorithm that tackles this problem by letting the meta-learner teach itself. The algorithm first bootstraps a target from the meta-learner, then optimises the meta-learner by minimising the distance to that target under a chosen (pseudo-)metric. Focusing on meta-learning with gradients, we establish conditions that guarantee performance improvements and show that the metric can control meta-optimisation. Meanwhile, the bootstrapping mechanism can extend the effective meta-learning horizon without requiring backpropagation through all updates. We achieve a new state-of-the art for model-free agents on the Atari ALE benchmark and demonstrate that it yields both performance and efficiency gains in multi-task meta-learning. Finally, we explore how bootstrapping opens up new possibilities and find that it can meta-learn efficient exploration in an epsilon-greedy Q-learning agent, without backpropagating through the update rule.

TLDR

An algorithm is proposed that tackles a challenging meta-optimisation problem by letting the meta-learner teach itself, and it is found that it can meta-learn efficient exploration in an epsilon-greedy Q-learning agent, without backpropagating through the update rule.

ICML

The Importance of Non-Markovianity in Maximum State Entropy Exploration

  • Mirco Mutti, Ric De Santi, Marcello Restelli

  • International Conference on Machine Learning

  • February 7, 2022

In the maximum state entropy exploration framework, an agent interacts with a reward-free environment to learn a policy that maximizes the entropy of the expected state visitations it is inducing. Hazan et al. (2019) noted that the class of Markovian stochastic policies is sufficient for the maximum state entropy objective, and exploiting non-Markovianity is generally considered pointless in this setting. In this paper, we argue that non-Markovianity is instead paramount for maximum state entropy exploration in a finite-sample regime. Especially, we recast the objective to target the expected entropy of the induced state visitations in a single trial. Then, we show that the class of non-Markovian deterministic policies is sufficient for the introduced objective, while Markovian policies suffer non-zero regret in general. However, we prove that the problem of finding an optimal non-Markovian policy is NP-hard. Despite this negative result, we discuss avenues to address the problem in a tractable way and how non-Markovian exploration could benefit the sample efficiency of online reinforcement learning in future works.

TLDR

This paper recast the objective to target the expected entropy of the induced state visitations in a single trial, and shows that the class of non-Markovian deterministic policies is sufficient for the introduced objective, while Markovian policies suffer non-zero regret in general.

Understanding Dataset Difficulty with V-Usable Information

  • Kawin Ethayarajh, Yejin Choi, Swabha Swayamdipta

  • International Conference on Machine Learning

  • December 31, 2021

Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty—w.r.t. a model V —as the lack of V - usable information (Xu et al., 2019), where a lower value indicates a more difficult dataset for V . We further introduce pointwise V -information ( PVI ) for measuring the difficulty of individual instances w.r.t. a given distribution. While standard evaluation metrics typically only compare different models for the same dataset, V - usable information and PVI also permit the converse: for a given model V , we can compare different datasets, as well as different instances/slices of the same dataset. Furthermore, our framework allows for the interpretability of different input attributes via transformations of the input, which we use to discover annotation artefacts in widely-used NLP benchmarks.

TLDR

This work frames dataset difficulty—w.r.t. a model V —as the lack of V - usable information (Xu et al., 2019), where a lower value indicates a more difficult dataset for V .

Stable Conformal Prediction Sets

  • Eugène Ndiaye

  • International Conference on Machine Learning

  • December 19, 2021

When one observes a sequence of variables $(x_1, y_1), \ldots, (x_n, y_n)$, Conformal Prediction (CP) is a methodology that allows to estimate a confidence set for $y_{n+1}$ given $x_{n+1}$ by merely assuming that the distribution of the data is exchangeable. CP sets have guaranteed coverage for any finite population size $n$. While appealing, the computation of such a set turns out to be infeasible in general, e.g. when the unknown variable $y_{n+1}$ is continuous. The bottleneck is that it is based on a procedure that readjusts a prediction model on data where we replace the unknown target by all its possible values in order to select the most probable one. This requires computing an infinite number of models, which often makes it intractable. In this paper, we combine CP techniques with classical algorithmic stability bounds to derive a prediction set computable with a single model fit. We demonstrate that our proposed confidence set does not lose any coverage guarantees while avoiding the need for data splitting as currently done in the literature. We provide some numerical experiments to illustrate the tightness of our estimation when the sample size is sufficiently large, on both synthetic and real datasets.

TLDR

This paper combines CP techniques with classical algorithmic stability bounds to derive a prediction set computable with a single model fit, and demonstrates that the proposed confidence set does not lose any coverage guarantees while avoiding the need for data splitting as currently done in the literature.

IJCAI

Levin Tree Search with Context Models

  • Laurent Orseau, Marcus Hutter, Levi H.S. Leli

  • International Joint Conference on Artificial Intelligence

  • May 26, 2023

Levin Tree Search (LTS) is a search algorithm that makes use of a policy (a probability distribution over actions) and comes with a theoretical guarantee on the number of expansions before reaching a goal node, depending on the quality of the policy. This guarantee can be used as a loss function, which we call the LTS loss, to optimize neural networks representing the policy (LTS+NN). In this work we show that the neural network can be substituted with parameterized context models originating from the online compression literature (LTS+CM). We show that the LTS loss is convex under this new model, which allows for using standard convex optimization tools, and obtain convergence guarantees to the optimal parameters in an online setting for a given set of solution trajectories --- guarantees that cannot be provided for neural networks. The new LTS+CM algorithm compares favorably against LTS+NN on several benchmarks: Sokoban (Boxoban), The Witness, and the 24-Sliding Tile puzzle (STP). The difference is particularly large on STP, where LTS+NN fails to solve most of the test instances while LTS+CM solves each test instance in a fraction of a second. Furthermore, we show that LTS+CM is able to learn a policy that solves the Rubik's cube in only a few hundred expansions, which considerably improves upon previous machine learning techniques.

TLDR

This work shows that the neural network can be substituted with parameterized context models originating from the online compression literature (LTS+CM) and obtain convergence guarantees to the optimal parameters in an online setting for a given set of solution trajectories --- guarantees that cannot be provided for neural networks.

Plurality Veto: A Simple Voting Rule Achieving Optimal Metric Distortion

  • Fatih Erdem Kizilkaya, D. Kempe

  • International Joint Conference on Artificial Intelligence

  • June 14, 2022

The metric distortion framework posits that n voters and m candidates are jointly embedded in a metric space such that voters rank candidates that are closer to them higher. A voting rule's purpose is to pick a candidate with minimum total distance to the voters, given only the rankings, but not the actual distances. As a result, in the worst case, each deterministic rule picks a candidate whose total distance is at least three times larger than that of an optimal one, i.e., has distortion at least 3. A recent breakthrough result showed that achieving this bound of 3 is possible; however, the proof is non-constructive, and the voting rule itself is a complicated exhaustive search. Our main result is an extremely simple voting rule, called Plurality Veto, which achieves the same optimal distortion of 3. Each candidate starts with a score equal to his number of first-place votes. These scores are then gradually decreased via an n-round veto process in which a candidate drops out when his score reaches zero. One after the other, voters decrement the score of their bottom choice among the standing candidates, and the last standing candidate wins. We give a one-paragraph proof that this voting rule achieves distortion 3. This rule is also immensely practical, and it only makes two queries to each voter, so it has low communication overhead. We also show that a straightforward extension can be used to give a constructive proof of the more general Ranking-Matching Lemma of Gkatzelis et al. We also generalize Plurality Veto into a class of randomized voting rules in the following way: Plurality veto is run only for k < n rounds; then, a candidate is chosen with probability proportional to his residual score. This general rule interpolates between Random Dictatorship (for k=0) and Plurality Veto (for k=n-1), and k controls the variance of the output. We show that for all k, this rule has expected distortion at most 3.

TLDR

An extremely simple voting rule, called Plurality Veto, which achieves the same optimal distortion of 3, and it is shown that for all k, this rule has expected distortion at most 3.

KDD

All in One: Multi-Task Prompting for Graph Neural Networks

  • Xiangguo Sun, Hongtao Cheng, Jia Li, Bo Liu, J. Guan

  • Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

  • July 4, 2023

Recently, "pre-training and fine-tuning'' has been adopted as a standard workflow for many graph tasks since it can take general graph knowledge to relieve the lack of graph annotations from each application. However, graph tasks with node level, edge level, and graph level are far diversified, making the pre-training pretext often incompatible with these multiple tasks. This gap may even cause a "negative transfer'' to the specific application, leading to poor results. Inspired by the prompt learning in natural language processing (NLP), which has presented significant effectiveness in leveraging prior knowledge for various NLP tasks, we study the prompting topic for graphs with the motivation of filling the gap between pre-trained models and various graph tasks. In this paper, we propose a novel multi-task prompting method for graph models. Specifically, we first unify the format of graph prompts and language prompts with the prompt token, token structure, and inserting pattern. In this way, the prompting idea from NLP can be seamlessly introduced to the graph area. Then, to further narrow the gap between various graph tasks and state-of-the-art pre-training strategies, we further study the task space of various graph applications and reformulate downstream problems to the graph-level task. Afterward, we introduce meta-learning to efficiently learn a better initialization for the multi-task prompt of graphs so that our prompting framework can be more reliable and general for different tasks. We conduct extensive experiments, results from which demonstrate the superiority of our method.

TLDR

This paper proposes a novel multi-task prompting method for graph models that unify the format of graph prompts and language prompts with the prompt token, token structure, and inserting pattern, and introduces meta-learning to efficiently learn a better initialization for the multi- task prompt of graphs so that the prompting framework can be more reliable and general for different tasks.

Improving Training Stability for Multitask Ranking Models in Recommender Systems

  • Jiaxi Tang, Yoel Drori, Daryl Chang, M. Sathiamoorthy, J. Gilmer, Li Wei, Xinyang Yi, Lichan Hong, Ed H. Chi

  • Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

  • February 17, 2023

Recommender systems play an important role in many content platforms. While most recommendation research is dedicated to designing better models to improve user experience, we found that research on stabilizing the training for such models is severely under-explored. As recommendation models become larger and more sophisticated, they are more susceptible to training instability issues, i.e., loss divergence, which can make the model unusable, waste significant resources and block model developments. In this paper, we share our findings and best practices we learned for improving the training stability of a real-world multitask ranking model for YouTube recommendations. We show some properties of the model that lead to unstable training and conjecture on the causes. Furthermore, based on our observations of training dynamics near the point of training instability, we hypothesize why existing solutions would fail, and propose a new algorithm to mitigate the limitations of existing solutions. Our experiments on YouTube production dataset show the proposed algorithm can significantly improve training stability while not compromising convergence, comparing with several commonly used baseline methods.

TLDR

The findings and best practices learned for improving the training stability of a real-world multitask ranking model for YouTube recommendations are shared and a new algorithm is proposed to mitigate the limitations of existing solutions.

Learning Causal Effects on Hypergraphs

  • Jing Ma, Mengting Wan, Longqi Yang, Jundong Li, Brent J. Hecht, J. Teevan

  • Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

  • July 7, 2022

Hypergraphs provide an effective abstraction for modeling multi-way group interactions among nodes, where each hyperedge can connect any number of nodes. Different from most existing studies which leverage statistical dependencies, we study hypergraphs from the perspective of causality. Specifically, in this paper, we focus on the problem of individual treatment effect (ITE) estimation on hypergraphs, aiming to estimate how much an intervention (e.g., wearing face covering) would causally affect an outcome (e.g., COVID-19 infection) of each individual node. Existing works on ITE estimation either assume that the outcome on one individual should not be influenced by the treatment assignments on other individuals (i.e., no interference), or assume the interference only exists between pairs of connected individuals in an ordinary graph. We argue that these assumptions can be unrealistic on real-world hypergraphs, where higher-order interference can affect the ultimate ITE estimations due to the presence of group interactions. In this work, we investigate high-order interference modeling, and propose a new causality learning framework powered by hypergraph neural networks. Extensive experiments on real-world hypergraphs verify the superiority of our framework over existing baselines.

TLDR

This work investigates high-order interference modeling, and proposes a new causality learning framework powered by hypergraph neural networks, which is verified over existing baselines on real-world hypergraphs.

FederatedScope-GNN: Towards a Unified, Comprehensive and Efficient Package for Federated Graph Learning

  • Zhen Wang, Weirui Kuang, Yuexiang Xie, Liuyi Yao, Yaliang Li, Bolin Ding, Jingren Zhou

  • Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

  • April 12, 2022

The incredible development of federated learning (FL) has benefited various tasks in the domains of computer vision and natural language processing, and the existing frameworks such as TFF and FATE has made the deployment easy in real-world applications. However, federated graph learning (FGL), even though graph data are prevalent, has not been well supported due to its unique characteristics and requirements. The lack of FGL-related framework increases the efforts for accomplishing reproducible research and deploying in real-world applications. Motivated by such strong demand, in this paper, we first discuss the challenges in creating an easy-to-use FGL package and accordingly present our implemented package FederatedScope-GNN (FS-G), which provides (1) a unified view for modularizing and expressing FGL algorithms; (2) comprehensive DataZoo and ModelZoo for out-of-the-box FGL capability; (3) an efficient model auto-tuning component; and (4) off-the-shelf privacy attack and defense abilities. We validate the effectiveness of FS-G by conducting extensive experiments, which simultaneously gains many valuable insights about FGL for the community. Moreover, we employ FS-G to serve the FGL application in real-world E-commerce scenarios, where the attained improvements indicate great potential business benefits. We publicly release FS-G, as submodules of FederatedScope, at https://github.com/alibaba/FederatedScope to promote FGL's research and enable broad applications that would otherwise be infeasible due to the lack of a dedicated package.

TLDR

This paper presents the implemented package FederatedScope-GNN (FS-G), which provides a unified view for modularizing and expressing FGL algorithms, and employs FS-G to serve the FGL application in real-world E-commerce scenarios, where the attained improvements indicate great potential business benefits.

NEURIPS

Are Emergent Abilities of Large Language Models a Mirage?

  • Rylan Schaeffer, B. Miranda, Oluwasanmi Koyejo

  • ArXiv

  • April 28, 2023

Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance. We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abilities; (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on BIG-Bench; and (3) show to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep networks. Via all three analyses, we provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.

TLDR

Evidence is provided that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.

Is Out-of-Distribution Detection Learnable?

  • Zhen Fang, Yixuan Li, Jie Lu, Jiahua Dong, Bo Han, Feng Liu

  • ArXiv

  • October 26, 2022

Supervised learning aims to train a classifier under the assumption that training and test data are from the same distribution. To ease the above assumption, researchers have studied a more realistic setting: out-of-distribution (OOD) detection, where test data may come from classes that are unknown during training (i.e., OOD data). Due to the unavailability and diversity of OOD data, good generalization ability is crucial for effective OOD detection algorithms. To study the generalization of OOD detection, in this paper, we investigate the probably approximately correct (PAC) learning theory of OOD detection, which is proposed by researchers as an open problem. First, we find a necessary condition for the learnability of OOD detection. Then, using this condition, we prove several impossibility theorems for the learnability of OOD detection under some scenarios. Although the impossibility theorems are frustrating, we find that some conditions of these impossibility theorems may not hold in some practical scenarios. Based on this observation, we next give several necessary and sufficient conditions to characterize the learnability of OOD detection in some practical scenarios. Lastly, we also offer theoretical supports for several representative OOD detection works based on our OOD theory.

TLDR

This paper investigates the probably approximately correct (PAC) learning theory of OOD detection, which is proposed by researchers as an open problem and proves several impossibility theorems for the learnability of Ood detection under some scenarios.

On-Demand Sampling: Learning Optimally from Multiple Distributions

  • Nika Haghtalab, M.I. Jordan, Eric Zhao

  • ArXiv

  • October 22, 2022

Social and real-world considerations such as robustness, fairness, social welfare and multi-agent tradeoffs have given rise to multi-distribution learning paradigms, such as collaborative, group distributionally robust, and fair federated learning. In each of these settings, a learner seeks to minimize its worst-case loss over a set of $n$ predefined distributions, while using as few samples as possible. In this paper, we establish the optimal sample complexity of these learning paradigms and give algorithms that meet this sample complexity. Importantly, our sample complexity bounds exceed that of the sample complexity of learning a single distribution only by an additive factor of $n \log(n) / \epsilon^2$. These improve upon the best known sample complexity of agnostic federated learning by Mohri et al. by a multiplicative factor of $n$, the sample complexity of collaborative learning by Nguyen and Zakynthinou by a multiplicative factor $\log n / \epsilon^3$, and give the first sample complexity bounds for the group DRO objective of Sagawa et al. To achieve optimal sample complexity, our algorithms learn to sample and learn from distributions on demand. Our algorithm design and analysis is enabled by our extensions of stochastic optimization techniques for solving stochastic zero-sum games. In particular, we contribute variants of Stochastic Mirror Descent that can trade off between players' access to cheap one-off samples or more expensive reusable ones.

TLDR

The optimal sample complexity of multi-distribution learning paradigms, such as collaborative, group distributionally robust, and fair federated learning is established and algorithms that meet this sample complexity are given.

Beyond neural scaling laws: beating power law scaling via data pruning

  • Ben Sorscher, Robert Geirhos, Shashank Shekhar, S. Ganguli, Ari S. Morcos

  • ArXiv

  • June 29, 2022

Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how in theory we can break beyond power law scaling and potentially even reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this improved scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling in practice on ResNets trained on CIFAR-10, SVHN, and ImageNet. Next, given the importance of finding high-quality pruning metrics, we perform the first large-scale benchmarking study of ten different data pruning metrics on ImageNet. We find most existing high performing metrics scale poorly to ImageNet, while the best are computationally intensive and require labels for every image. We therefore developed a new simple, cheap and scalable self-supervised pruning metric that demonstrates comparable performance to the best supervised metrics. Overall, our work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws, thereby reducing the resource costs of modern deep learning.

TLDR

This work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws, thereby reducing the resource costs of modern deep learning.

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

  • Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, Roozbeh Mottaghi

  • ArXiv

  • June 14, 2022

Massive datasets and high-capacity models have driven many recent advancements in computer vision and natural language understanding. This work presents a platform to enable similar success stories in Embodied AI. We propose PROCTHOR, a framework for procedural generation of Embodied AI environments. PROCTHOR enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation, interaction, and manipulation tasks. We demonstrate the power and potential of PROCTHOR via a sample of 10,000 generated houses and a simple neural model. Models trained using only RGB images on PROCTHOR, with no explicit mapping and no human task supervision produce state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation, including the presently running Habitat 2022, AI2-THOR Rearrangement 2022, and RoboTHOR challenges. We also demonstrate strong 0-shot results on these benchmarks, via pre-training on PROCTHOR with no fine-tuning on the downstream benchmark, often beating previous state-of-the-art systems that access the downstream training data.

TLDR

The proposed PROCTHOR, a framework for procedural generation of Embodied AI environments, enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation, interaction, and manipulation tasks.

High-dimensional limit theorems for SGD: Effective dynamics and critical scaling

  • G. B. Arous, R. Gheissari, Aukosh Jagannath

  • ArXiv

  • June 8, 2022

We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in the high-dimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finite-dimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the step-size. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two-layer networks for binary and XOR-type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub-optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations. At the same time, we demonstrate the benefit of overparametrization by showing that the latter probability goes to zero as the second layer width grows.

TLDR

The approach allows one to choose the summary statistics that are tracked, the initialization, and the step-size, and yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices.

A Neural Corpus Indexer for Document Retrieval

  • Yujing Wang, Ying Hou, Hong Wang, Ziming Miao, Shibin Wu, Hao Sun, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, Xing Xie, Hao Sun, Weiwei Deng, Qi Zhang, Mao Yang

  • ArXiv

  • June 6, 2022

Current state-of-the-art document retrieval solutions mainly follow an index-retrieve paradigm, where the index is hard to be directly optimized for the final retrieval target. In this paper, we aim to show that an end-to-end deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. To this end, we propose Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates relevant document identifiers directly for a designated query. To optimize the recall performance of NCI, we invent a prefix-aware weight-adaptive decoder architecture, and leverage tailored techniques including query generation, semantic document identifiers, and consistency-based regularization. Empirical studies demonstrated the superiority of NCI on two commonly used academic benchmarks, achieving +21.4% and +16.8% relative enhancement for Recall@1 on NQ320k dataset and R-Precision on TriviaQA dataset, respectively, compared to the best baseline method.

TLDR

This paper proposes Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates relevant document identifiers directly for a designated query and leverages tailored techniques including query generation, semantic document identifiers, and consistency-based regularization.

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

  • Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. S. Mahdavi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, Mohammad Norouzi

  • ArXiv

  • May 23, 2022

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See https://imagen.research.google/ for an overview of the results.

TLDR

This work presents Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding, and finds that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.

SIGIR

The Information Retrieval Experiment Platform

  • Maik Frobe, Jan Heinrich Reimer, Sean MacAvaney, Niklas Deckers, Simon Reich, Janek Bevendorff, Benno Stein, Matthias Hagen, Martin Potthast

  • Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

  • May 30, 2023

We integrate irdatasets, ir_measures, and PyTerrier with TIRA in the Information Retrieval Experiment Platform (TIREx) to promote more standardized, reproducible, scalable, and even blinded retrieval experiments. Standardization is achieved when a retrieval approach implements PyTerrier's interfaces and the input and output of an experiment are compatible with ir_datasets and ir_measures. However, none of this is a must for reproducibility and scalability, as TIRA can run any dockerized software locally or remotely in a cloud-native execution environment. Version control and caching ensure efficient (re)execution. TIRA allows for blind evaluation when an experiment runs on a remote server or cloud not under the control of the experimenter. The test data and ground truth are then hidden from public access, and the retrieval software has to process them in a sandbox that prevents data leaks. We currently host an instance of TIREx with 15 corpora (1.9~billion documents) on which 32 shared retrieval tasks are based. Using Docker images of 50~standard retrieval approaches, we automatically evaluated all approaches on all tasks (50 ⋅ 32 = 1,600 runs) in less than a week on a midsize cluster (1,620 cores and 24 GPUs). This instance of TIREx is open for submissions and will be integrated with the IR Anthology, as well as released open source.

TLDR

A Non-Factoid Question-Answering Taxonomy

  • Valeriia Bolotova, Vladislav Blinov, Falk Scholer, W. Bruce Croft, M. Sanderson

  • Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

  • July 6, 2022

Non-factoid question answering (NFQA) is a challenging and under-researched task that requires constructing long-form answers, such as explanations or opinions, to open-ended non-factoid questions - NFQs. There is still little understanding of the categories of NFQs that people tend to ask, what form of answers they expect to see in return, and what the key research challenges of each category are. This work presents the first comprehensive taxonomy of NFQ categories and the expected structure of answers. The taxonomy was constructed with a transparent methodology and extensively evaluated via crowdsourcing. The most challenging categories were identified through an editorial user study. We also release a dataset of categorised NFQs and a question category classifier. Finally, we conduct a quantitative analysis of the distribution of question categories using major NFQA datasets, showing that the NFQ categories that are the most challenging for current NFQA systems are poorly represented in these datasets. This imbalance may lead to insufficient system performance for challenging categories. The new taxonomy, along with the category classifier, will aid research in the area, helping to create more balanced benchmarks and to focus models on addressing specific categories.

TLDR

This work presents the first comprehensive taxonomy of NFQ categories and the expected structure of answers, constructed with a transparent methodology and extensively evaluated via crowdsourcing.

WWW

Simplistic Collection and Labeling Practices Limit the Utility of Benchmark Datasets for Twitter Bot Detection

  • C. Hays, Zachary Schutzman, Manish Raghavan, Erin Walk, Philipp Zimmer

  • Proceedings of the ACM Web Conference 2023

  • January 17, 2023

Accurate bot detection is necessary for the safety and integrity of online platforms. It is also crucial for research on the influence of bots in elections, the spread of misinformation, and financial market manipulation. Platforms deploy infrastructure to flag or remove automated accounts, but their tools and data are not publicly available. Thus, the public must rely on third-party bot detection. These tools employ machine learning and often achieve near-perfect performance for classification on existing datasets, suggesting bot detection is accurate, reliable and fit for use in downstream applications. We provide evidence that this is not the case and show that high performance is attributable to limitations in dataset collection and labeling rather than sophistication of the tools. Specifically, we show that simple decision rules — shallow decision trees trained on a small number of features — achieve near-state-of-the-art performance on most available datasets and that bot detection datasets, even when combined together, do not generalize well to out-of-sample datasets. Our findings reveal that predictions are highly dependent on each dataset’s collection and labeling procedures rather than fundamental differences between bots and humans. These results have important implications for both transparency in sampling and labeling procedures and potential biases in research using existing bot detection tools for pre-processing.

TLDR

It is shown that simple decision rules — shallow decision trees trained on a small number of features — achieve near-state- of-the-art performance on most available datasets and that bot detection datasets, even when combined together, do not generalize well to out-of-sample datasets.

Rewiring What-to-Watch-Next Recommendations to Reduce Radicalization Pathways

  • Francesco Fabbri, Yanhao Wang, F. Bonchi, C. Castillo, M. Mathioudakis

  • Proceedings of the ACM Web Conference 2022

  • February 1, 2022

Recommender systems typically suggest to users content similar to what they consumed in the past. If a user happens to be exposed to strongly polarized content, she might subsequently receive recommendations which may steer her towards more and more radicalized content, eventually being trapped in what we call a “radicalization pathway”. In this paper, we study the problem of mitigating radicalization pathways using a graph-based approach. Specifically, we model the set of recommendations of a “what-to-watch-next” recommender as a d-regular directed graph where nodes correspond to content items, links to recommendations, and paths to possible user sessions. We measure the “segregation” score of a node representing radicalized content as the expected length of a random walk from that node to any node representing non-radicalized content. High segregation scores are associated to larger chances to get users trapped in radicalization pathways. Hence, we define the problem of reducing the prevalence of radicalization pathways by selecting a small number of edges to “rewire”, so to minimize the maximum of segregation scores among all radicalized nodes, while maintaining the relevance of the recommendations. We prove that the problem of finding the optimal set of recommendations to rewire is NP-hard and NP-hard to approximate within any factor. Therefore, we turn our attention to heuristics, and propose an efficient yet effective greedy algorithm based on the absorbing random walk theory. Our experiments on real-world datasets in the context of video and news recommendations confirm the effectiveness of our proposal.

TLDR

This paper models the set of recommendations of a “what-to-watch-next” recommender as a d-regular directed graph where nodes correspond to content items, links to recommendations, and paths to possible user sessions, and proposes an efficient yet effective greedy algorithm based on the absorbing random walk theory.

Theory

FOCS
SODA

Dynamic Algorithms for Maximum Matching Size

  • Soheil Behnezhad

  • ACM-SIAM Symposium on Discrete Algorithms

  • July 15, 2022

We study fully dynamic algorithms for maximum matching. This is a well-studied problem, known to admit several update-time/approximation trade-offs. For instance, it is known how to maintain a 1/2-approximate matching in $\log^{O(1)} n$ update time or a $2/3$-approximate matching in $O(\sqrt{n})$ update time, where $n$ is the number of vertices. It has been a long-standing open problem to determine whether either of these bounds can be improved. In this paper, we show that when the goal is to maintain just the size of the matching (and not its edge-set), then these bounds can indeed be improved. First, we give an algorithm that takes $\log^{O(1)} n$ update-time and maintains a $.501$-approximation ($.585$-approximation if the graph is bipartite). Second, we give an algorithm that maintains a $(2/3 + \Omega(1))$-approximation in $O(\sqrt{n})$ time for bipartite graphs. Our results build on new connections to sublinear time algorithms. In particular, a key tool for both is an algorithm of the author for estimating the size of maximal matchings in $\widetilde{O}(n)$ time [Behnezhad; FOCS 2021]. Our second result also builds on the edge-degree constrained subgraph (EDCS) of Bernstein and Stein [ICALP'15, SODA'16]. In particular, while it has been known that EDCS may not include a better than 2/3-approximation, we give a new characterization of such tight instances which allows us to break it. We believe this characterization might be of independent interest.

TLDR

This paper gives an algorithm that takes $\log^{O(1)} n$ update-time and maintains a $(2/3 + \Omega(1)$-approximation in $O(\sqrt{n})$ time for bipartite graphs, and gives a new characterization of such tight instances which allows us to break EDCS.

New Diameter-Reducing Shortcuts and Directed Hopsets: Breaking the $\sqrt{n}$ Barrier

  • Shimon Kogan, Merav Parter

  • ACM-SIAM Symposium on Discrete Algorithms

  • November 25, 2021

For an n-vertex digraph G = (V, E), a shortcut set is a (small) subset of edges H taken from the transitive closure of G that, when added to G guarantees that the diameter of G ∪ H is small. Shortcut sets, introduced by Thorup in 1993, have a wide range of applications in algorithm design, especially in the context of parallel, distributed and dynamic computation on directed graphs. A folklore result in this context shows that every n-vertex digraph admits a shortcut set of linear size (i.e., of O(n) edges) that reduces the diameter to1 . Despite extensive research over the years, the question of whether one can reduce the diameter to with Õ(n) shortcut edges has been left open. We provide the first improved diameter-sparsity tradeoff for this problem, breaking the diameter barrier. Specifically, we show an O(nω)-time randomized algorithm2 for computing a linear shortcut set that reduces the diameter of the digraph to Õ(n1/3). This narrows the gap w.r.t the current diameter lower bound of Ω(n1/6) by [Huang and Pettie, SWAT'18]. Moreover, we show that a diameter of O(n1/2) can in fact be achieved with a sublinear number of O(n3/4) shortcut edges. Formally, letting S(n, D) be the bound on the size of the shortcut set required in order to reduce the diameter of any n-vertex digraph to at most D, our algorithms yield: S(n, D) = { Õ(n2/D3), for D ≤ n1/3, Õ((n/D)3/2), for D > n1/3 . We also extend our algorithms to provide improved (β, ∊) hopsets for n-vertex weighted directed graphs.

TLDR

It is shown that a diameter of Õ(n1/2) can in fact be achieved with a sublinear number of O(n3/4) shortcut edges, and the first improved diameter-sparsity tradeoff is provided, breaking the √ n diameter barrier.

STOC

Doubly Efficient Private Information Retrieval and Fully Homomorphic RAM Computation from Ring LWE

  • Wei-Kai Lin, Ethan Mook, Daniel Wichs

  • Proceedings of the 55th Annual ACM Symposium on Theory of Computing

  • June 2, 2023

A (single server) private information retrieval (PIR) allows a client to read data from a public database held on a remote server, without revealing to the server which locations she is reading. In a doubly efficient PIR (DEPIR), the database is first preprocessed, but the server can subsequently answer any client’s query in time that is sub-linear in the database size. Prior work gave a plausible candidate for a public-key variant of DEPIR, where a trusted party is needed to securely preprocess the database and generate a corresponding public key for the clients; security relied on a new non-standard code-based assumption and a heuristic use of ideal obfuscation. In this work we construct the stronger unkeyed notion of DEPIR, where the preprocessing is a deterministic procedure that the server can execute on its own. Moreover, we prove security under just the standard ring learning-with-errors (RingLWE) assumption. For a database of size N and any constant ε>0, the preprocessing run-time and size is O(N1+ε), while the run-time and communication-complexity of each PIR query is polylog(N). We also show how to update the preprocessed database in time O(Nε). Our approach is to first construct a standard PIR where the server’s computation consists of evaluating a multivariate polynomial; we then convert it to a DEPIR by preprocessing the polynomial to allow for fast evaluation, using the techniques of Kedlaya and Umans (STOC ’08). Building on top of our DEPIR, we construct general fully homomorphic encryption for random-access machines (RAM-FHE), which allows a server to homomorphically evaluate an arbitrary RAM program P over a client’s encrypted input x and the server’s preprocessed plaintext input y to derive an encryption of the output P(x,y) in time that scales with the RAM run-time of the computation rather than its circuit size. Prior work only gave a heuristic candidate construction of a restricted notion of RAM-FHE. In this work, we construct RAM-FHE under the RingLWE assumption with circular security. For a RAM program P with worst-case run-time T, the homomorphic evaluation runs in time T1+ε · (|x| + |y|).

TLDR

This work constructs the stronger unkeyed notion of DEPIR, where the preprocessing is a deterministic procedure that the server can execute on its own, and proves security under just the standard ring learning-with-errors (RingLWE) assumption.

The Randomized 𝑘-Server Conjecture Is False!

  • Sébastien Bubeck, Christian Coester, Y. Rabani

  • Proceedings of the 55th Annual ACM Symposium on Theory of Computing

  • November 10, 2022

We prove a few new lower bounds on the randomized competitive ratio for the k-server problem and other related problems, resolving some long-standing conjectures. In particular, for metrical task systems (MTS) we asympotically settle the competitive ratio and obtain the first improvement to an existential lower bound since the introduction of the model 35 years ago (in 1987). More concretely, we show: (1) There exist (k+1)-point metric spaces in which the randomized competitive ratio for the k-server problem is Ω(log2 k). This refutes the folklore conjecture (which is known to hold in some families of metrics) that in all metric spaces with at least k+1 points, the competitive ratio is Θ(logk). (2) Consequently, there exist n-point metric spaces in which the randomized competitive ratio for MTS is Ω(log2 n). This matches the upper bound that holds for all metrics. The previously best existential lower bound was Ω(logn) (which was known to be tight for some families of metrics). (3) For all k<n∈, for all n-point metric spaces the randomized k-server competitive ratio is at least Ω(logk), and consequently the randomized MTS competitive ratio is at least Ω(logn). These universal lower bounds are asymptotically tight. The previous bounds were Ω(logk/loglogk) and Ω(logn/loglogn), respectively. (4) The randomized competitive ratio for the w-set metrical service systems problem, and its equivalent width-w layered graph traversal problem, is Ω(w2). This slightly improves the previous lower bound and matches the recently discovered upper bound. (5) Our results imply improved lower bounds for other problems like k-taxi, distributed paging, and metric allocation. These lower bounds share a common thread, and other than the third bound, also a common construction.

TLDR

It is shown that for metrical task systems (MTS) the competitive ratio is settled and the first improvement to an existential lower bound since the introduction of the model 35 years ago is obtained.

Locally testable codes with constant rate, distance, and locality

  • Irit Dinur, Shai Evra, R. Livne, A. Lubotzky, S. Mozes

  • Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing

  • November 8, 2021

A locally testable code (LTC) is an error correcting code that has a property-tester. The tester reads q bits that are randomly chosen, and rejects words with probability proportional to their distance from the code. The parameter q is called the locality of the tester. LTCs were initially studied as important components of probabilistically checkable proofs (PCP), and since then the topic has evolved on its own. High rate LTCs could be useful in practice: before attempting to decode a received word, one can save time by first quickly testing if it is close to the code. An outstanding open question has been whether there exist “c3-LTCs”, namely LTCs with constant rate, constant distance, and constant locality. In this work we construct such codes based on a new two-dimensional complex which we call a left-right Cayley complex. This is essentially a graph which, in addition to vertices and edges, also has squares. Our codes can be viewed as a two-dimensional version of (the one-dimensional) expander codes, where the codewords are functions on the squares rather than on the edges.

TLDR

This work constructs LTCs with constant rate, constant distance, and constant locality based on a new two-dimensional complex which they call a left-right Cayley complex, which is essentially a graph which, in addition to vertices and edges, also has squares.

Latest News & Updates

Case Study: Iterative Design for Skimming Support

Case Study: Iterative Design for Skimming Support

How might we help researchers quickly assess the relevance of scientific literature? Take a closer look at Skimming, Semantic Reader’s latest AI feature, and the collaborative design process behind it.

Behind the Scenes of Semantic Scholar’s New Author Influence Design

Behind the Scenes of Semantic Scholar’s New Author Influence Design

We released a new version of Author Influence interface to help scholars better discover other scholars in their fields. Here's how we identified user insights and made those design choices.

Artificial-intelligence search engines wrangle academic literature

Artificial-intelligence search engines wrangle academic literature

Nature had a chat with Dan Weld, Chief Scientist at Semantic Scholar, to discuss how search engines are helping scientists explore and innovate by making it easier to draw connections from a massive collection of scientific literature.

Experience a smarter way to search and discover scholarly research.

Create Your Account