Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeDon't Rank, Combine! Combining Machine Translation Hypotheses Using Quality Estimation
Neural machine translation systems estimate probabilities of target sentences given source sentences, yet these estimates may not align with human preferences. This work introduces QE-fusion, a method utilizing a quality estimation metric (QE) that better correlates with human judgments to synthesize improved translations. QE-fusion leverages a candidate pool sampled from a model, combining spans from different candidates using QE metrics such as CometKiwi. We compare QE-fusion against beam search and recent reranking techniques, such as Minimum Bayes Risk decoding or QE-reranking. Our method consistently improves translation quality in terms of COMET and BLEURT scores when applied to large language models (LLMs) used for translation (PolyLM, XGLM, Llama2, and Mistral) and to multilingual translation models (NLLB), over five language pairs. Notably, QE-fusion exhibits larger improvements for LLMs due to their ability to generate diverse outputs. We demonstrate that our approach generates novel translations in over half of the cases and consistently outperforms other methods across varying numbers of candidates (5-200). Furthermore, we empirically establish that QE-fusion scales linearly with the number of candidates in the pool. QE-fusion proves effective in enhancing LLM-based translation without the need for costly retraining of LLMs.
Best-First Beam Search
Decoding for many NLP tasks requires an effective heuristic algorithm for approximating exact search since the problem of searching the full output space is often intractable, or impractical in many settings. The default algorithm for this job is beam search -- a pruned version of breadth-first search. Quite surprisingly, beam search often returns better results than exact inference due to beneficial search bias for NLP tasks. In this work, we show that the standard implementation of beam search can be made up to 10x faster in practice. Our method assumes that the scoring function is monotonic in the sequence length, which allows us to safely prune hypotheses that cannot be in the final set of hypotheses early on. We devise effective monotonic approximations to popular nonmonontic scoring functions, including length normalization and mutual information decoding. Lastly, we propose a memory-reduced variant of Best-First Beam Search, which has a similar beneficial search bias in terms of downstream performance, but runs in a fraction of the time.
On Hallucination and Predictive Uncertainty in Conditional Language Generation
Despite improvements in performances on different natural language generation tasks, deep neural models are prone to hallucinating facts that are incorrect or nonexistent. Different hypotheses are proposed and examined separately for different tasks, but no systematic explanations are available across these tasks. In this study, we draw connections between hallucinations and predictive uncertainty in conditional language generation. We investigate their relationship in both image captioning and data-to-text generation and propose a simple extension to beam search to reduce hallucination. Our analysis shows that higher predictive uncertainty corresponds to a higher chance of hallucination. Epistemic uncertainty is more indicative of hallucination than aleatoric or total uncertainties. It helps to achieve better results of trading performance in standard metric for less hallucination with the proposed beam search variant.
Conditional Poisson Stochastic Beam Search
Beam search is the default decoding strategy for many sequence generation tasks in NLP. The set of approximate K-best items returned by the algorithm is a useful summary of the distribution for many applications; however, the candidates typically exhibit high overlap and may give a highly biased estimate for expectations under our model. These problems can be addressed by instead using stochastic decoding strategies. In this work, we propose a new method for turning beam search into a stochastic process: Conditional Poisson stochastic beam search. Rather than taking the maximizing set at each iteration, we sample K candidates without replacement according to the conditional Poisson sampling design. We view this as a more natural alternative to Kool et. al. 2019's stochastic beam search (SBS). Furthermore, we show how samples generated under the CPSBS design can be used to build consistent estimators and sample diverse sets from sequence models. In our experiments, we observe CPSBS produces lower variance and more efficient estimators than SBS, even showing improvements in high entropy settings.
Toward Reliable Biomedical Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models
Large language models (LLMs) have shown significant potential in scientific disciplines such as biomedicine, particularly in hypothesis generation, where they can analyze vast literature, identify patterns, and suggest research directions. However, a key challenge lies in evaluating the truthfulness of generated hypotheses, as verifying their accuracy often requires substantial time and resources. Additionally, the hallucination problem in LLMs can lead to the generation of hypotheses that appear plausible but are ultimately incorrect, undermining their reliability. To facilitate the systematic study of these challenges, we introduce TruthHypo, a benchmark for assessing the capabilities of LLMs in generating truthful biomedical hypotheses, and KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our results show that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs. Human evaluations further validate the utility of KnowHD in identifying truthful hypotheses and accelerating scientific discovery. Our data and source code are available at https://github.com/Teddy-XiongGZ/TruthHypo.
Calculation of prompt diphoton production cross sections at Tevatron and LHC energies
A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron, and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs boson are contrasted with those produced from QCD processes at the LHC, showing that enhanced sensitivity to the signal can be obtained with judicious selection of events.
Statistics of X-Ray Polarization Measurements
The polarization of an X-ray beam that produces electrons with velocity components perpendicular to the beam generates an azimuthal distribution of the ejected electrons. We present methods for simulating and for analyzing the angular dependence of electron detections which enable us to derive simple analytical expressions for useful statistical properties of observable data. The derivations are verified by simulations. While we confirm the results of previous work on this topic, we provide an extension needed for analytical treatment of the full range of possible polarization amplitudes.
Beam Tree Recursive Cells
We propose Beam Tree Recursive Cell (BT-Cell) - a backpropagation-friendly framework to extend Recursive Neural Networks (RvNNs) with beam search for latent structure induction. We further extend this framework by proposing a relaxation of the hard top-k operators in beam search for better propagation of gradient signals. We evaluate our proposed models in different out-of-distribution splits in both synthetic and realistic data. Our experiments show that BTCell achieves near-perfect performance on several challenging structure-sensitive synthetic tasks like ListOps and logical inference while maintaining comparable performance in realistic data against other RvNN-based models. Additionally, we identify a previously unknown failure case for neural models in generalization to unseen number of arguments in ListOps. The code is available at: https://github.com/JRC1995/BeamTreeRecursiveCells.
Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding
We endow Large Language Models (LLMs) with fine-grained self-evaluation to refine multi-step reasoning inference. We propose an effective prompting approach that integrates self-evaluation guidance through stochastic beam search. Our approach explores the reasoning search space using a well-calibrated automatic criterion. This enables an efficient search to produce higher-quality final predictions. With the self-evaluation guided stochastic beam search, we also balance the quality-diversity trade-off in the generation of reasoning chains. This allows our approach to adapt well with majority voting and surpass the corresponding Codex-backboned baselines by 6.34%, 9.56%, and 5.46% on the GSM8K, AQuA, and StrategyQA benchmarks, respectively, in few-shot accuracy. Analysis of our decompositional reasoning finds it pinpoints logic failures and leads to higher consistency and robustness. Our code is publicly available at https://github.com/YuxiXie/SelfEval-Guided-Decoding.
Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning
Recent advancements have significantly augmented the reasoning capabilities of Large Language Models (LLMs) through various methodologies, especially chain-of-thought (CoT) reasoning. However, previous methods fail to address reasoning errors in intermediate steps, leading to accumulative errors. In this paper, we propose Deductive Beam Search (DBS), which seamlessly integrates CoT and deductive reasoning with step-wise beam search for LLMs. Our approach deploys a verifier, verifying the deducibility of a reasoning step and its premises, thus alleviating the error accumulation. Furthermore, we introduce a scalable and labor-free data construction method to amplify our model's verification capabilities. Extensive experiments demonstrate that our approach significantly enhances the base performance of LLMs of various scales (7B, 13B, 70B, and ChatGPT) across 8 reasoning datasets from 3 diverse reasoning genres, including arithmetic, commonsense, and symbolic. Moreover, our analysis proves DBS's capability of detecting diverse and subtle reasoning errors and robustness on different model scales.
Sparse Autoencoders for Hypothesis Generation
We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., "mentions being surprised or shocked") using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.
Interpreting Black Box Models via Hypothesis Testing
In science and medicine, model interpretations may be reported as discoveries of natural phenomena or used to guide patient treatments. In such high-stakes tasks, false discoveries may lead investigators astray. These applications would therefore benefit from control over the finite-sample error rate of interpretations. We reframe black box model interpretability as a multiple hypothesis testing problem. The task is to discover "important" features by testing whether the model prediction is significantly different from what would be expected if the features were replaced with uninformative counterfactuals. We propose two testing methods: one that provably controls the false discovery rate but which is not yet feasible for large-scale applications, and an approximate testing method which can be applied to real-world data sets. In simulation, both tests have high power relative to existing interpretability methods. When applied to state-of-the-art vision and language models, the framework selects features that intuitively explain model predictions. The resulting explanations have the additional advantage that they are themselves easy to interpret.
Sparks of Science: Hypothesis Generation Using Structured Paper Data
Generating novel and creative scientific hypotheses is a cornerstone in achieving Artificial General Intelligence. Large language and reasoning models have the potential to aid in the systematic creation, selection, and validation of scientifically informed hypotheses. However, current foundation models often struggle to produce scientific ideas that are both novel and feasible. One reason is the lack of a dedicated dataset that frames Scientific Hypothesis Generation (SHG) as a Natural Language Generation (NLG) task. In this paper, we introduce HypoGen, the first dataset of approximately 5500 structured problem-hypothesis pairs extracted from top-tier computer science conferences structured with a Bit-Flip-Spark schema, where the Bit is the conventional assumption, the Spark is the key insight or conceptual leap, and the Flip is the resulting counterproposal. HypoGen uniquely integrates an explicit Chain-of-Reasoning component that reflects the intellectual process from Bit to Flip. We demonstrate that framing hypothesis generation as conditional language modelling, with the model fine-tuned on Bit-Flip-Spark and the Chain-of-Reasoning (and where, at inference, we only provide the Bit), leads to improvements in the overall quality of the hypotheses. Our evaluation employs automated metrics and LLM judge rankings for overall quality assessment. We show that by fine-tuning on our HypoGen dataset we improve the novelty, feasibility, and overall quality of the generated hypotheses. The HypoGen dataset is publicly available at huggingface.co/datasets/UniverseTBD/hypogen-dr1.
Uncertain Evidence in Probabilistic Models and Stochastic Simulators
We consider the problem of performing Bayesian inference in probabilistic models where observations are accompanied by uncertainty, referred to as "uncertain evidence." We explore how to interpret uncertain evidence, and by extension the importance of proper interpretation as it pertains to inference about latent variables. We consider a recently-proposed method "distributional evidence" as well as revisit two older methods: Jeffrey's rule and virtual evidence. We devise guidelines on how to account for uncertain evidence and we provide new insights, particularly regarding consistency. To showcase the impact of different interpretations of the same uncertain evidence, we carry out experiments in which one interpretation is defined as "correct." We then compare inference results from each different interpretation illustrating the importance of careful consideration of uncertain evidence.
MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback
Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet-lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on large language model's internal reasoning without incorporating empirical outcomes from experiments. We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones. However, developing such strategies is challenging due to the impracticality of repeatedly conducting real experiments in natural science domains. To address this, we propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis, perturbed by noise. We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator. Building on this simulator, we develop a pseudo experiment-guided ranking method that clusters hypotheses by shared functional characteristics and prioritizes candidates based on insights derived from simulated experimental feedback. Experiments show that our method outperforms pre-experiment baselines and strong ablations.
Literature Meets Data: A Synergistic Approach to Hypothesis Generation
AI holds promise for transforming scientific processes, including hypothesis generation. Prior work on hypothesis generation can be broadly categorized into theory-driven and data-driven approaches. While both have proven effective in generating novel and plausible hypotheses, it remains an open question whether they can complement each other. To address this, we develop the first method that combines literature-based insights with data to perform LLM-powered hypothesis generation. We apply our method on five different datasets and demonstrate that integrating literature and data outperforms other baselines (8.97\% over few-shot, 15.75\% over literature-based alone, and 3.37\% over data-driven alone). Additionally, we conduct the first human evaluation to assess the utility of LLM-generated hypotheses in assisting human decision-making on two challenging tasks: deception detection and AI generated content detection. Our results show that human accuracy improves significantly by 7.44\% and 14.19\% on these tasks, respectively. These findings suggest that integrating literature-based and data-driven approaches provides a comprehensive and nuanced framework for hypothesis generation and could open new avenues for scientific inquiry.
PathFinder: Guided Search over Multi-Step Reasoning Paths
With recent advancements in large language models, methods like chain-of-thought prompting to elicit reasoning chains have been shown to improve results on reasoning tasks. However, tasks that require multiple steps of reasoning still pose significant challenges to state-of-the-art models. Drawing inspiration from the beam search algorithm, we propose PathFinder, a tree-search-based reasoning path generation approach. It enhances diverse branching and multi-hop reasoning through the integration of dynamic decoding, enabled by varying sampling methods and parameters. Using constrained reasoning, PathFinder integrates novel quality constraints, pruning, and exploration methods to enhance the efficiency and the quality of generation. Moreover, it includes scoring and ranking features to improve candidate selection. Our approach outperforms competitive baselines on three complex arithmetic and commonsense reasoning tasks by 6% on average. Our model generalizes well to longer, unseen reasoning chains, reflecting similar complexities to beam search with large branching factors.
HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation
There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate methods for hypothesis generation? To address this, we introduce HypoBench, a novel benchmark designed to evaluate LLMs and hypothesis generation methods across multiple aspects, including practical utility, generalizability, and hypothesis discovery rate. HypoBench includes 7 real-world tasks and 5 synthetic tasks with 194 distinct datasets. We evaluate four state-of-the-art LLMs combined with six existing hypothesis-generation methods. Overall, our results suggest that existing methods are capable of discovering valid and novel patterns in the data. However, the results from synthetic datasets indicate that there is still significant room for improvement, as current hypothesis generation methods do not fully uncover all relevant or meaningful patterns. Specifically, in synthetic settings, as task difficulty increases, performance significantly drops, with best models and methods only recovering 38.8% of the ground-truth hypotheses. These findings highlight challenges in hypothesis generation and demonstrate that HypoBench serves as a valuable resource for improving AI systems designed to assist scientific discovery.
Observation of nuclear modification of energy-energy correlators inside jets in heavy ion collisions
Energy-energy correlators are constructed by averaging the number of charged particle pairs within jets, weighted by the product of their transverse momenta, as a function of the angular separation of the particles within a pair. They are sensitive to a multitude of perturbative and nonperturbative quantum chromodynamics phenomena in high-energy particle collisions. Using lead-lead data recorded with the CMS detector, energy-energy correlators inside high transverse momentum jets are measured in heavy ion collisions for the first time. The data are obtained at a nucleon-nucleon center-of-mass energy of 5.02 TeV and correspond to an integrated luminosity of 1.70 nb^{-1}. A similar analysis is done for proton-proton collisions at the same center-of-mass energy to establish a reference. The ratio of lead-lead to proton-proton energy-energy correlators reveals significant jet substructure modifications in the quark-gluon plasma. The results are compared to different models that incorporate either color coherence or medium response effects, where the two effects predict similar substructure modifications.
IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery
The rapid advancement in capabilities of large language models (LLMs) raises a pivotal question: How can LLMs accelerate scientific discovery? This work tackles the crucial first stage of research, generating novel hypotheses. While recent work on automated hypothesis generation focuses on multi-agent frameworks and extending test-time compute, none of the approaches effectively incorporate transparency and steerability through a synergistic Human-in-the-loop (HITL) approach. To address this gap, we introduce IRIS: Interactive Research Ideation System, an open-source platform designed for researchers to leverage LLM-assisted scientific ideation. IRIS incorporates innovative features to enhance ideation, including adaptive test-time compute expansion via Monte Carlo Tree Search (MCTS), fine-grained feedback mechanism, and query-based literature synthesis. Designed to empower researchers with greater control and insight throughout the ideation process. We additionally conduct a user study with researchers across diverse disciplines, validating the effectiveness of our system in enhancing ideation. We open-source our code at https://github.com/Anikethh/IRIS-Interactive-Research-Ideation-System
Addendum to Research MMMCV; A Man/Microbio/Megabio/Computer Vision
In October 2007, a Research Proposal for the University of Sydney, Australia, the author suggested that biovie-physical phenomenon as `electrodynamic dependant biological vision', is governed by relativistic quantum laws and biovision. The phenomenon on the basis of `biovielectroluminescence', satisfies man/microbio/megabio/computer vision (MMMCV), as a robust candidate for physical and visual sciences. The general aim of this addendum is to present a refined text of Sections 1-3 of that proposal and highlighting the contents of its Appendix in form of a `Mechanisms' Section. We then briefly remind in an article aimed for December 2007, by appending two more equations into Section 3, a theoretical II-time scenario as a time model well-proposed for the phenomenon. The time model within the core of the proposal, plays a significant role in emphasizing the principle points on Objectives no. 1-8, Sub-hypothesis 3.1.2, mentioned in Article [arXiv:0710.0410]. It also expresses the time concept in terms of causing quantized energy f(|E|) of time |t|, emit in regard to shortening the probability of particle loci as predictable patterns of particle's un-occurred motion, a solution to Heisenberg's uncertainty principle (HUP) into a simplistic manner. We conclude that, practical frames via a time algorithm to this model, fixates such predictable patterns of motion of scenery bodies onto recordable observation points of a MMMCV system. It even suppresses/predicts superposition phenomena coming from a human subject and/or other bio-subjects for any decision making event, e.g., brainwave quantum patterns based on vision. Maintaining the existential probability of Riemann surfaces of II-time scenarios in the context of biovielectroluminescence, makes motion-prediction a possibility.
Faithful Reasoning Using Large Language Models
Although contemporary large language models (LMs) demonstrate impressive question-answering capabilities, their answers are typically the product of a single call to the model. This entails an unwelcome degree of opacity and compromises performance, especially on problems that are inherently multi-step. To address these limitations, we show how LMs can be made to perform faithful multi-step reasoning via a process whose causal structure mirrors the underlying logical structure of the problem. Our approach works by chaining together reasoning steps, where each step results from calls to two fine-tuned LMs, one for selection and one for inference, to produce a valid reasoning trace. Our method carries out a beam search through the space of reasoning traces to improve reasoning quality. We demonstrate the effectiveness of our model on multi-step logical deduction and scientific question-answering, showing that it outperforms baselines on final answer accuracy, and generates humanly interpretable reasoning traces whose validity can be checked by the user.
Model-agnostic search for the quasinormal modes of gravitational wave echoes
Post-merger gravitational wave echoes provide a unique opportunity to probe the near-horizon structure of astrophysical black holes, that may be modified due to non-perturbative quantum gravity phenomena. However, since the waveform is subject to large theoretical uncertainties, it is necessary to develop model-agnostic search methods for detecting echoes from observational data. A promising strategy is to identify the characteristic quasinormal modes (QNMs) associated with echoes, {\it in frequency space}, which complements existing searches of quasiperiodic pulses in time. In this study, we build upon our previous work targeting these modes by incorporating relative phase information to optimize the Bayesian search algorithm. Using a new phase-marginalized likelihood, the performance can be significantly improved for well-resolved QNMs. This enables an efficient model-agnostic search for QNMs of different shapes by using a simple search template. To demonstrate the robustness of the search algorithm, we construct four complementary benchmarks for the echo waveform that span a diverse range of different theoretical possibilities for the near-horizon structure. We then validate our Bayesian search algorithms by injecting the benchmark models into different realizations of Gaussian noise. Using two types of phase-marginalized likelihoods, we find that the search algorithm can efficiently detect the corresponding QNMs. Therefore, our search strategy provides a concrete Bayesian and model-agnostic approach to "quantum black hole seismology".
AlphaMath Almost Zero: process Supervision without process
Recent advancements in large language models (LLMs) have substantially enhanced their mathematical reasoning abilities. However, these models still struggle with complex problems that require multiple reasoning steps, frequently leading to logical or numerical errors. While numerical mistakes can be largely addressed by integrating a code interpreter, identifying logical errors within intermediate steps is more challenging. Moreover, manually annotating these steps for training is not only expensive but also labor-intensive, requiring the expertise of professional annotators. In our study, we introduce an innovative approach that bypasses the need for process annotations (from human or GPTs) by utilizing the Monte Carlo Tree Search (MCTS) framework. This technique automatically generates both the process supervision and the step-level evaluation signals. Our method iteratively trains the policy and value models, leveraging the capabilities of a well-pretrained LLM to progressively enhance its mathematical reasoning skills. Furthermore, we propose an efficient inference strategy-step-level beam search, where the value model is crafted to assist the policy model (i.e., LLM) in navigating more effective reasoning paths, rather than solely relying on prior probabilities. The experimental results on both in-domain and out-of-domain datasets demonstrate that even without GPT-4 or human-annotated process supervision, our AlphaMath framework achieves comparable or superior results to previous state-of-the-art methods.
Cosmic Multipoles in Galaxy Surveys Part I: How Inferences Depend on Source Counts and Masks
We present a new approach to constructing and fitting dipoles and higher-order multipoles in synthetic galaxy samples over the sky. Within our Bayesian paradigm, we illustrate that this technique is robust to masked skies, allowing us to make credible inferences about the relative contributions of each multipole. We also show that dipoles can be recovered in surveys with small footprints, determining the requisite source counts required for concrete estimation of the dipole parameters. This work is motivated by recent probes of the cosmic dipole in galaxy catalogues. Namely, the kinematic dipole of the Cosmic Microwave Background, as arising from the motion of our heliocentric frame at approx 370 km,s^{-1}, implies that an analogous dipole should be observed in the number counts of galaxies in flux-density-limited samples. Recent studies have reported a dipole aligning with the kinematic dipole but with an anomalously large amplitude. Accordingly, our new technique will be important as forthcoming galaxy surveys are made available and for revisiting previous data.
Early Warning Signals and the Prosecutor's Fallacy
Early warning signals have been proposed to forecast the possibility of a critical transition, such as the eutrophication of a lake, the collapse of a coral reef, or the end of a glacial period. Because such transitions often unfold on temporal and spatial scales that can be difficult to approach by experimental manipulation, research has often relied on historical observations as a source of natural experiments. Here we examine a critical difference between selecting systems for study based on the fact that we have observed a critical transition and those systems for which we wish to forecast the approach of a transition. This difference arises by conditionally selecting systems known to experience a transition of some sort and failing to account for the bias this introduces -- a statistical error often known as the Prosecutor's Fallacy. By analysing simulated systems that have experienced transitions purely by chance, we reveal an elevated rate of false positives in common warning signal statistics. We further demonstrate a model-based approach that is less subject to this bias than these more commonly used summary statistics. We note that experimental studies with replicates avoid this pitfall entirely.
Efficient Massive Black Hole Binary parameter estimation for LISA using Sequential Neural Likelihood
The inspiral, merger, and ringdown of Massive Black Hole Binaries (MBHBs) is one the main sources of Gravitational Waves (GWs) for the future Laser Interferometer Space Antenna (LISA), an ESA-led mission in the implementation phase. It is expected that LISA will detect these systems throughout the entire observable universe. Robust and efficient data analysis algorithms are necessary to detect and estimate physical parameters for these systems. In this work, we explore the application of Sequential Neural Likelihood, a simulation-based inference algorithm, to detect and characterize MBHB GW signals in synthetic LISA data. We describe in detail the different elements of the method, their performance and possible alternatives that can be used to enhance the performance. Instead of sampling from the conventional likelihood function, which requires a forward simulation for each evaluation, this method constructs a surrogate likelihood that is ultimately described by a neural network trained from a dataset of simulations of the MBHB signals and noise. One important advantage of this method is that, given that the likelihood is independent of the priors, we can iteratively train models that target specific observations in a fraction of the time and computational cost that other traditional and machine learning-based strategies would require. Because of the iterative nature of the method, we are able to train models to obtain qualitatively similar posteriors with less than 2\% of the simulator calls that Markov Chain Monte Carlo methods would require. We compare these posteriors with those obtained from Markov Chain Monte Carlo techniques and discuss the differences that appear, in particular in relation with the important role that data compression has in the modular implementation of the method that we present. We also discuss different strategies to improve the performance of the algorithms.
A Survey on Hypothesis Generation for Scientific Discovery in the Era of Large Language Models
Hypothesis generation is a fundamental step in scientific discovery, yet it is increasingly challenged by information overload and disciplinary fragmentation. Recent advances in Large Language Models (LLMs) have sparked growing interest in their potential to enhance and automate this process. This paper presents a comprehensive survey of hypothesis generation with LLMs by (i) reviewing existing methods, from simple prompting techniques to more complex frameworks, and proposing a taxonomy that categorizes these approaches; (ii) analyzing techniques for improving hypothesis quality, such as novelty boosting and structured reasoning; (iii) providing an overview of evaluation strategies; and (iv) discussing key challenges and future directions, including multimodal integration and human-AI collaboration. Our survey aims to serve as a reference for researchers exploring LLMs for hypothesis generation.
Reconstruction of inclined extensive air showers using radio signals: from arrival times and amplitudes to direction and energy
Radio detection is now an established technique for the study of ultra-high-energy (UHE) cosmic rays with energies above sim10^{17} eV. The next-generation of radio experiments aims to extend this technique to the observation of UHE earth-skimming neutrinos, which requires the detection of very inclined extensive air showers (EAS). In this article we present a new reconstruction method for the arrival direction and the energy of EAS. It combines a point-source-like description of the radio wavefront with a phenomenological model: the Angular Distribution Function (ADF). The ADF describes the angular distribution of the radio signal amplitude in the 50-200 MHz frequency range, with a particular focus on the Cherenkov angle, a crucial feature of the radio amplitude pattern. The method is applicable to showers with zenith angles larger than 60^circ, and in principle up to neutrino-induced showers with up-going trajectories. It is tested here on a simulated data set of EAS induced by cosmic rays. A resolution better than 4 arc-minutes (0.07^circ) is achieved on arrival direction, as well as an intrinsic resolution of 5% on the electromagnetic energy, and around 15% on the primary energy.
Extension of the J-PARC Hadron Experimental Facility: Third White Paper
The J-PARC Hadron Experimental Facility was constructed with an aim to explore the origin and evolution of matter in the universe through the experiments with intense particle beams. In the past decade, many results on particle and nuclear physics have been obtained at the present facility. To expand the physics programs to unexplored regions never achieved, the extension project of the Hadron Experimental Facility has been extensively discussed. This white paper presents the physics of the extension of the Hadron Experimental Facility for resolving the issues in the fields of the strangeness nuclear physics, hadron physics, and flavor physics.
MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses
Scientific discovery contributes largely to human society's prosperity, and recent progress shows that LLMs could potentially catalyze this process. However, it is still unclear whether LLMs can discover novel and valid hypotheses in chemistry. In this work, we investigate this central research question: Can LLMs automatically discover novel and valid chemistry research hypotheses given only a chemistry research background (consisting of a research question and/or a background survey), without limitation on the domain of the research question? After extensive discussions with chemistry experts, we propose an assumption that a majority of chemistry hypotheses can be resulted from a research background and several inspirations. With this key insight, we break the central question into three smaller fundamental questions. In brief, they are: (1) given a background question, whether LLMs can retrieve good inspirations; (2) with background and inspirations, whether LLMs can lead to hypothesis; and (3) whether LLMs can identify good hypotheses to rank them higher. To investigate these questions, we construct a benchmark consisting of 51 chemistry papers published in Nature, Science, or a similar level in 2024 (all papers are only available online since 2024). Every paper is divided by chemistry PhD students into three components: background, inspirations, and hypothesis. The goal is to rediscover the hypothesis, given only the background and a large randomly selected chemistry literature corpus consisting the ground truth inspiration papers, with LLMs trained with data up to 2023. We also develop an LLM-based multi-agent framework that leverages the assumption, consisting of three stages reflecting the three smaller questions. The proposed method can rediscover many hypotheses with very high similarity with the ground truth ones, covering the main innovations.
Maximising information from weak lensing galaxy surveys
Weak lensing galaxy surveys are currently undergoing a dramatic revolution as the dawn of the Stage-IV surveys are upon us. Hence, ensuring that our analysis methods are as accurate and precise as the raw data is of upmost importance. This motivated the development of a new implementation of the quadratic maximum likelihood power spectrum estimation technique, the application of the theoretical uncertainties approach to mitigate baryonic feedback biases, and to re-evaluating the criterion from which binary scale cuts are derived when aiming to eliminate baryonic biases. These techniques maximise the available information from weak lensing observations while minimising potential systematic biases, and shows how this PhD thesis contributes to the advancement of weak lensing cosmology.
Massive MIMO Beam Management in Sub-6 GHz 5G NR
Beam codebooks are a new feature of massive multiple-input multiple-output (M-MIMO) in 5G new radio (NR). Codebooks comprised of beamforming vectors are used to transmit reference signals and obtain limited channel state information (CSI) from receivers via the codeword index. This enables large arrays that cannot otherwise obtain sufficient CSI. The performance, however, is limited by the codebook design. In this paper, we show that machine learning can be used to train site-specific codebooks for initial access. We design a neural network based on an autoencoder architecture that uses a beamspace observation in combination with RF environment characteristics to improve the synchronization signal (SS) burst codebook. We test our algorithm using a flexible dataset of channels generated from QuaDRiGa. The results show that our model outperforms the industry standard (DFT beams) and approaches the optimal performance (perfect CSI and singular value decomposition (SVD)-based beamforming), using only a few bits of feedback.
Beyond Symmetries : Anomalies in Transverse Ward--Takahashi Identities
Anomalies in transverse Ward--Takahashi identities are studied, allowing discussion of the feasibility of anomalies arising in general non-symmetry Ward--Takahashi identities. We adopt the popular Fujikawa's method and rigorous dimensional renormalization to verify the existence of transverse anomalies to one-loop order and any loop order, respectively. The arbitrariness of coefficients of transverse anomalies is revealed, and a way out is also proposed after relating transverse anomalies to Schwinger terms and comparing symmetry and non-symmetry anomalies. Papers that claim the non-existence of transverse anomalies are reviewed to find anomalies hidden in their approaches. The role played by transverse anomalies is discussed.
Detecting LHC Neutrinos at Surface Level
The first direct detection of neutrinos at the LHC not only marks the beginning of a novel collider neutrino program at CERN but also motivates considering additional neutrino detectors to fully exploit the associated physics potential. We investigate the feasibility and physics potential of neutrino experiments located at the surface-level. A topographic desk study was performed to identify all points at which the LHC's neutrino beams exit the earth. The closest location lies about 9 km east of the CMS interaction point, at the bottom of Lake Geneva. Several detectors to be placed at this location are considered, including a water Cherenkov detector and an emulsion detector. The detector concepts are introduced, and projections for their contribution to the LHC forward neutrino program and searches for dark sector particles are presented. However, the dilution of the neutrino flux over distance reduces the neutrino yield significantly, limiting the physics potential of surface-level detectors compared to ones closer to the interaction point, including the proposed FPF.
Fast kernel methods for Data Quality Monitoring as a goodness-of-fit test
We here propose a machine learning approach for monitoring particle detectors in real-time. The goal is to assess the compatibility of incoming experimental data with a reference dataset, characterising the data behaviour under normal circumstances, via a likelihood-ratio hypothesis test. The model is based on a modern implementation of kernel methods, nonparametric algorithms that can learn any continuous function given enough data. The resulting approach is efficient and agnostic to the type of anomaly that may be present in the data. Our study demonstrates the effectiveness of this strategy on multivariate data from drift tube chamber muon detectors.
If beam search is the answer, what was the question?
Quite surprisingly, exact maximum a posteriori (MAP) decoding of neural language generators frequently leads to low-quality results. Rather, most state-of-the-art results on language generation tasks are attained using beam search despite its overwhelmingly high search error rate. This implies that the MAP objective alone does not express the properties we desire in text, which merits the question: if beam search is the answer, what was the question? We frame beam search as the exact solution to a different decoding objective in order to gain insights into why high probability under a model alone may not indicate adequacy. We find that beam search enforces uniform information density in text, a property motivated by cognitive science. We suggest a set of decoding objectives that explicitly enforce this property and find that exact decoding with these objectives alleviates the problems encountered when decoding poorly calibrated language generation models. Additionally, we analyze the text produced using various decoding strategies and see that, in our neural machine translation experiments, the extent to which this property is adhered to strongly correlates with BLEU.
The Tracking Machine Learning challenge : Throughput phase
This paper reports on the second "Throughput" phase of the Tracking Machine Learning (TrackML) challenge on the Codalab platform. As in the first "Accuracy" phase, the participants had to solve a difficult experimental problem linked to tracking accurately the trajectory of particles as e.g. created at the Large Hadron Collider (LHC): given O(10^5) points, the participants had to connect them into O(10^4) individual groups that represent the particle trajectories which are approximated helical. While in the first phase only the accuracy mattered, the goal of this second phase was a compromise between the accuracy and the speed of inference. Both were measured on the Codalab platform where the participants had to upload their software. The best three participants had solutions with good accuracy and speed an order of magnitude faster than the state of the art when the challenge was designed. Although the core algorithms were less diverse than in the first phase, a diversity of techniques have been used and are described in this paper. The performance of the algorithms are analysed in depth and lessons derived.
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation
There is growing excitement about the potential of Language Models (LMs) to accelerate scientific discovery. Falsifying hypotheses is key to scientific progress, as it allows claims to be iteratively refined over time. This process requires significant researcher effort, reasoning, and ingenuity. Yet current benchmarks for LMs predominantly assess their ability to generate solutions rather than challenge them. We advocate for developing benchmarks that evaluate this inverse capability - creating counterexamples for subtly incorrect solutions. To demonstrate this approach, we start with the domain of algorithmic problem solving, where counterexamples can be evaluated automatically using code execution. Specifically, we introduce REFUTE, a dynamically updating benchmark that includes recent problems and incorrect submissions from programming competitions, where human experts successfully identified counterexamples. Our analysis finds that the best reasoning agents, even OpenAI o3-mini (high) with code execution feedback, can create counterexamples for only <9% of incorrect solutions in REFUTE, even though ratings indicate its ability to solve up to 48% of these problems from scratch. We hope our work spurs progress in evaluating and enhancing LMs' ability to falsify incorrect solutions - a capability that is crucial for both accelerating research and making models self-improve through reliable reflective reasoning.
LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation
Recent advancements in large language models (LLMs) have sparked considerable interest in automated theorem proving and a prominent line of research integrates stepwise LLM-based provers into tree search. In this paper, we introduce a novel proof-state exploration approach for training data synthesis, designed to produce diverse tactics across a wide range of intermediate proof states, thereby facilitating effective one-shot fine-tuning of LLM as the policy model. We also propose an adaptive beam size strategy, which effectively takes advantage of our data synthesis method and achieves a trade-off between exploration and exploitation during tree search. Evaluations on the MiniF2F and ProofNet benchmarks demonstrate that our method outperforms strong baselines under the stringent Pass@1 metric, attaining an average pass rate of 60.74% on MiniF2F and 21.18% on ProofNet. These results underscore the impact of large-scale synthetic data in advancing automated theorem proving.
Neutron capture measurements for s-process nucleosynthesis; A review about CERN n_TOF developments and contributions
This article presents a review about the main CERN n\_TOF contributions to the field of neutron-capture experiments of interest for s-process nucleosynthesis studies over the last 25 years, with special focus on the measurement of radioactive isotopes. A few recent capture experiments on stable isotopes of astrophysical interest are also discussed. Results on s-process branching nuclei are appropriate to illustrate how advances in detection systems and upgrades in the facility have enabled increasingly challenging experiments and, as a consequence, have led to a better understanding and modeling of the s-process mechanism of nucleosynthesis. New endeavors combining radioactive-ion beams from ISOLDE for the production of radioisotopically pure samples for activation experiments at the new NEAR facility at n\_TOF are briefly discussed. On the basis of these new exciting results, also current limitations of state-of-the-art TOF and activation techniques will be depicted, thereby showing the pressing need for further upgrades and enhancements on both facilities and detection systems. A brief account of the potential technique based on inverse kinematics for direct neutron-capture measurements is also presented.
Large Language Models for Automated Open-domain Scientific Hypotheses Discovery
Hypothetical induction is recognized as the main reasoning type when scientists make observations about the world and try to propose hypotheses to explain those observations. Past research on hypothetical induction is under a constrained setting: (1) the observation annotations in the dataset are carefully manually handpicked sentences (resulting in a close-domain setting); and (2) the ground truth hypotheses are mostly commonsense knowledge, making the task less challenging. In this work, we tackle these problems by proposing the first dataset for social science academic hypotheses discovery, with the final goal to create systems that automatically generate valid, novel, and helpful scientific hypotheses, given only a pile of raw web corpus. Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity. A multi-module framework is developed for the task, including three different feedback mechanisms to boost performance, which exhibits superior performance in terms of both GPT-4 based and expert-based evaluation. To the best of our knowledge, this is the first work showing that LLMs are able to generate novel (''not existing in literature'') and valid (''reflecting reality'') scientific hypotheses.
Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents
Materials discovery and design are essential for advancing technology across various industries by enabling the development of application-specific materials. Recent research has leveraged Large Language Models (LLMs) to accelerate this process. We explore the potential of LLMs to generate viable hypotheses that, once validated, can expedite materials discovery. Collaborating with materials science experts, we curated a novel dataset from recent journal publications, featuring real-world goals, constraints, and methods for designing real-world applications. Using this dataset, we test LLM-based agents that generate hypotheses for achieving given goals under specific constraints. To assess the relevance and quality of these hypotheses, we propose a novel scalable evaluation metric that emulates the process a materials scientist would use to evaluate a hypothesis critically. Our curated dataset, proposed method, and evaluation framework aim to advance future research in accelerating materials discovery and design with LLMs.
Enhancing the significance of astrophysical events with multimessenger coincidences
Coincident multimessenger observations of cosmic sources can offer numerous benefits, especially when used in the context of synergistic astrophysics. One significant advantage is enhancing the detection significance of separate detectors by correlating their data and assuming joint emission. We have formulated an approach for updating the Bayesian posterior probability of an astrophysical origin, namely p_{rm astro}, relying on multimessenger coincidences assuming an emission model. The description is applicable to any combination of messengers. We demonstrated the formalism for the gravitational waves and high-energy neutrinos case. Applying our method to the public data of candidate coincident high-energy neutrinos with subthreshold gravitational-wave triggers, we found that in the case of highly energetic neutrino coincidences, p_{rm astro} can increase from approximately sim 0.1 to sim 0.9. The amount of improvement depends on the assumed joint emission model. If models are trusted, the marked improvement makes subthreshold detections much more confident. Moreover, the model dependency can also be used to test the consistency of different models. This work is a crucial step toward the goal of uniting all detectors on equal footing into a statistically integrated, Earth-sized observatory for comprehensive multimessenger astrophysics.
BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery
Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and then revising them based on data are fundamental to scientific discovery. Despite the significant promise of LLM-based scientific agents, no benchmarks systematically test LLM's ability to propose scientific models, collect experimental data, and revise them in light of new data. We introduce BoxingGym, a benchmark with 10 environments for systematically evaluating both experimental design (e.g. collecting data to test a scientific theory) and model discovery (e.g. proposing and revising scientific theories). To enable tractable and quantitative evaluation, we implement each environment as a generative probabilistic model with which a scientific agent can run interactive experiments. These probabilistic models are drawn from various real-world scientific domains ranging from psychology to ecology. To quantitatively evaluate a scientific agent's ability to collect informative experimental data, we compute the expected information gain (EIG), an information-theoretic quantity which measures how much an experiment reduces uncertainty about the parameters of a generative model. A good scientific theory is a concise and predictive explanation. Therefore, to quantitatively evaluate model discovery, we ask a scientific agent to explain their model and then assess whether this explanation enables another scientific agent to make reliable predictions about this environment. In addition to this explanation-based evaluation, we compute standard model evaluation metrics such as prediction errors. We find that current LLMs, such as GPT-4o, struggle with both experimental design and model discovery. We find that augmenting the LLM-based agent with an explicit statistical model does not reliably improve these results.
Pulsed Schlieren Imaging of Ultrasonic Haptics and Levitation using Phased Arrays
Ultrasonic acoustic fields have recently been used to generate haptic effects on the human skin as well as to levitate small sub-wavelength size particles. Schlieren imaging and background-oriented schlieren techniques can be used for acoustic wave pattern and beam shape visualization. These techniques exploit variations in the refractive index of a propagation medium by applying refractive optics or cross-correlation algorithms of photographs of illuminated background patterns. Here both background-oriented and traditional schlieren systems are used to visualize the regions of the acoustic power involved in creating dynamic haptic sensations and dynamic levitation traps. We demonstrate for the first time the application of back-ground-oriented schlieren for imaging ultrasonic fields in air. We detail our imaging apparatus and present improved algorithms used to visualize these phenomena that we have produced using multiple phased arrays. Moreover, to improve imaging, we leverage an electronically controlled, high-output LED which is pulsed in synchrony with the ultrasonic carrier frequency.
A Heavy-Metal Scenario of Ultra-High-Energy Cosmic Rays
The mass composition of ultra-high-energy cosmic rays is an open problem in astroparticle physics. It is usually inferred from the depth of the shower maximum (Xmax) of cosmic-ray showers, which is only ambiguously determined by modern hadronic interaction models. We examine a data-driven scenario, in which we consider the expectation value of Xmax as a free parameter. We test the novel hypothesis whether the cosmic-ray data from the Pierre Auger Observatory can be interpreted in a consistent picture, under the assumption that the mass composition of cosmic rays at the highest energies is dominated by high metallicity, resulting in pure iron nuclei at energies above ~40 EeV. We investigate the implications on astrophysical observations and hadronic interactions, and we discuss the global consistency of the data assuming this heavy-metal scenario. We conclude that the data from the Pierre Auger Observatory can be interpreted consistently if the expectation values for Xmax from modern hadronic interaction models are shifted to larger values.
Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design
Generative molecular design has moved from proof-of-concept to real-world applicability, as marked by the surge in very recent papers reporting experimental validation. Key challenges in explainability and sample efficiency present opportunities to enhance generative design to directly optimize expensive high-fidelity oracles and provide actionable insights to domain experts. Here, we propose Beam Enumeration to exhaustively enumerate the most probable sub-sequences from language-based molecular generative models and show that molecular substructures can be extracted. When coupled with reinforcement learning, extracted substructures become meaningful, providing a source of explainability and improving sample efficiency through self-conditioned generation. Beam Enumeration is generally applicable to any language-based molecular generative model and notably further improves the performance of the recently reported Augmented Memory algorithm, which achieved the new state-of-the-art on the Practical Molecular Optimization benchmark for sample efficiency. The combined algorithm generates more high reward molecules and faster, given a fixed oracle budget. Beam Enumeration shows that improvements to explainability and sample efficiency for molecular design can be made synergistic.
Beam Decoding with Controlled Patience
Text generation with beam search has proven successful in a wide range of applications. The commonly-used implementation of beam decoding follows a first come, first served heuristic: it keeps a set of already completed sequences over time steps and stops when the size of this set reaches the beam size. We introduce a patience factor, a simple modification to this decoding algorithm, that generalizes the stopping criterion and provides flexibility to the depth of search. Extensive empirical results demonstrate that the patience factor improves decoding performance of strong pretrained models on news text summarization and machine translation over diverse language pairs, with a negligible inference slowdown. Our approach only modifies one line of code and can be thus readily incorporated in any implementation.
The Virtual Quantum Optics Laboratory
We present a web-based software tool, the Virtual Quantum Optics Laboratory (VQOL), that may be used for designing and executing realistic simulations of quantum optics experiments. A graphical user interface allows one to rapidly build and configure a variety of different optical experiments, while the runtime environment provides unique capabilities for visualization and analysis. All standard linear optical components are available as well as sources of thermal, coherent, and entangled Gaussian states. A unique aspect of VQOL is the introduction of non-Gaussian measurements using detectors modeled as deterministic devices that "click" when the amplitude of the light falls above a given threshold. We describe the underlying theoretical models and provide several illustrative examples. We find that VQOL provides a a faithful representation of many experimental quantum optics phenomena and may serve as both a useful instructional tool for students as well as a valuable research tool for practitioners.
The Linear Representation Hypothesis and the Geometry of Large Language Models
Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.
Preserving Statistical Validity in Adaptive Data Analysis
A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of m adaptively chosen functions on an unknown distribution given n random samples. We show that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.
Diquark Correlations in Hadron Physics: Origin, Impact and Evidence
The last decade has seen a marked shift in how the internal structure of hadrons is understood. Modern experimental facilities, new theoretical techniques for the continuum bound-state problem and progress with lattice-regularised QCD have provided strong indications that soft quark+quark (diquark) correlations play a crucial role in hadron physics. For example, theory indicates that the appearance of such correlations is a necessary consequence of dynamical chiral symmetry breaking, viz. a corollary of emergent hadronic mass that is responsible for almost all visible mass in the universe; experiment has uncovered signals for such correlations in the flavour-separation of the proton's electromagnetic form factors; and phenomenology suggests that diquark correlations might be critical to the formation of exotic tetra- and penta-quark hadrons. A broad spectrum of such information is evaluated herein, with a view to consolidating the facts and therefrom moving toward a coherent, unified picture of hadron structure and the role that diquark correlations might play.
Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search
We present Grid Beam Search (GBS), an algorithm which extends beam search to allow the inclusion of pre-specified lexical constraints. The algorithm can be used with any model that generates a sequence hat{y} = {y_{0}ldots y_{T}} , by maximizing p(y | x) = prodlimits_{t}p(y_{t} | x; {y_{0} ldots y_{t-1}}) . Lexical constraints take the form of phrases or words that must be present in the output sequence. This is a very general way to incorporate additional knowledge into a model's output without requiring any modification of the model parameters or training data. We demonstrate the feasibility and flexibility of Lexically Constrained Decoding by conducting experiments on Neural Interactive-Predictive Translation, as well as Domain Adaptation for Neural Machine Translation. Experiments show that GBS can provide large improvements in translation quality in interactive scenarios, and that, even without any user input, GBS can be used to achieve significant gains in performance in domain adaptation scenarios.
Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
Neural sequence models are widely used to model time-series data. Equally ubiquitous is the usage of beam search (BS) as an approximate inference algorithm to decode output sequences from these models. BS explores the search space in a greedy left-right fashion retaining only the top-B candidates - resulting in sequences that differ only slightly from each other. Producing lists of nearly identical sequences is not only computationally wasteful but also typically fails to capture the inherent ambiguity of complex AI tasks. To overcome this problem, we propose Diverse Beam Search (DBS), an alternative to BS that decodes a list of diverse outputs by optimizing for a diversity-augmented objective. We observe that our method finds better top-1 solutions by controlling for the exploration and exploitation of the search space - implying that DBS is a better search algorithm. Moreover, these gains are achieved with minimal computational or memory over- head as compared to beam search. To demonstrate the broad applicability of our method, we present results on image captioning, machine translation and visual question generation using both standard quantitative metrics and qualitative human studies. Further, we study the role of diversity for image-grounded language generation tasks as the complexity of the image changes. We observe that our method consistently outperforms BS and previously proposed techniques for diverse decoding from neural sequence models.
Latent Beam Diffusion Models for Decoding Image Sequences
While diffusion models excel at generating high-quality images from text prompts, they struggle with visual consistency in image sequences. Existing methods generate each image independently, leading to disjointed narratives - a challenge further exacerbated in non-linear storytelling, where scenes must connect beyond adjacent frames. We introduce a novel beam search strategy for latent space exploration, enabling conditional generation of full image sequences with beam search decoding. Unlike prior approaches that use fixed latent priors, our method dynamically searches for an optimal sequence of latent representations, ensuring coherent visual transitions. To address beam search's quadratic complexity, we integrate a cross-attention mechanism that efficiently scores search paths and enables pruning, prioritizing alignment with both textual prompts and visual context. Human evaluations confirm that our approach outperforms baseline methods, producing full sequences with superior coherence, visual continuity, and textual alignment. By bridging advances in search optimization and latent space refinement, this work sets a new standard for structured image sequence generation.
BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering
Large language models (LLMs) have demonstrated strong reasoning capabilities. Nevertheless, they still suffer from factual errors when tackling knowledge-intensive tasks. Retrieval-augmented reasoning represents a promising approach. However, significant challenges still persist, including inaccurate and insufficient retrieval for complex questions, as well as difficulty in integrating multi-source knowledge. To address this, we propose Beam Aggregation Reasoning, BeamAggR, a reasoning framework for knowledge-intensive multi-hop QA. BeamAggR explores and prioritizes promising answers at each hop of question. Concretely, we parse the complex questions into trees, which include atom and composite questions, followed by bottom-up reasoning. For atomic questions, the LLM conducts reasoning on multi-source knowledge to get answer candidates. For composite questions, the LLM combines beam candidates, explores multiple reasoning paths through probabilistic aggregation, and prioritizes the most promising trajectory. Extensive experiments on four open-domain multi-hop reasoning datasets show that our method significantly outperforms SOTA methods by 8.5%. Furthermore, our analysis reveals that BeamAggR elicits better knowledge collaboration and answer aggregation.
Directed Beam Search: Plug-and-Play Lexically Constrained Language Generation
Large pre-trained language models are capable of generating realistic text. However, controlling these models so that the generated text satisfies lexical constraints, i.e., contains specific words, is a challenging problem. Given that state-of-the-art language models are too large to be trained from scratch in a manageable time, it is desirable to control these models without re-training them. Methods capable of doing this are called plug-and-play. Recent plug-and-play methods have been successful in constraining small bidirectional language models as well as forward models in tasks with a restricted search space, e.g., machine translation. However, controlling large transformer-based models to meet lexical constraints without re-training them remains a challenge. In this work, we propose Directed Beam Search (DBS), a plug-and-play method for lexically constrained language generation. Our method can be applied to any language model, is easy to implement and can be used for general language generation. In our experiments we use DBS to control GPT-2. We demonstrate its performance on keyword-to-phrase generation and we obtain comparable results as a state-of-the-art non-plug-and-play model for lexically constrained story generation.
The Quest for the Origins of Ultra-High-Energy Cosmic Rays
Significant progress has been made over the past decades towards unveiling the sources of the most energetic particles in nature, the ultra-high-energy cosmic rays (UHECRs). Despite these advancements, the exact astrophysical sites capable of accelerating these particles to such extreme energies remain largely unknown. Moreover, the mechanisms by which they achieve these extreme energies are poorly understood. Here, I provide a concise overview of the theory underlying the acceleration and propagation of UHECRs. I then critically discuss three recent results that could help unveil their origins: the reported excess around Centaurus A, the correlation with starburst galaxies, and the efforts to jointly model the energy spectrum, composition, and arrival directions. Finally, I discuss strategies for advancing this field, emphasising the need for refined theoretical models, the challenges in building them, and the potential for new observatories to shed light on the mysteries of UHECRs.
How well do SOTA legal reasoning models support abductive reasoning?
We examine how well the state-of-the-art (SOTA) models used in legal reasoning support abductive reasoning tasks. Abductive reasoning is a form of logical inference in which a hypothesis is formulated from a set of observations, and that hypothesis is used to explain the observations. The ability to formulate such hypotheses is important for lawyers and legal scholars as it helps them articulate logical arguments, interpret laws, and develop legal theories. Our motivation is to consider the belief that deep learning models, especially large language models (LLMs), will soon replace lawyers because they perform well on tasks related to legal text processing. But to do so, we believe, requires some form of abductive hypothesis formation. In other words, while LLMs become more popular and powerful, we want to investigate their capacity for abductive reasoning. To pursue this goal, we start by building a logic-augmented dataset for abductive reasoning with 498,697 samples and then use it to evaluate the performance of a SOTA model in the legal field. Our experimental results show that although these models can perform well on tasks related to some aspects of legal text processing, they still fall short in supporting abductive reasoning tasks.
Lamarr: LHCb ultra-fast simulation based on machine learning models deployed within Gauss
About 90% of the computing resources available to the LHCb experiment has been spent to produce simulated data samples for Run 2 of the Large Hadron Collider at CERN. The upgraded LHCb detector will be able to collect larger data samples, requiring many more simulated events to analyze the data to be collected in Run 3. Simulation is a key necessity of analysis to interpret signal, reject background and measure efficiencies. The needed simulation will far exceed the pledged resources, requiring an evolution in technologies and techniques to produce these simulated data samples. In this contribution, we discuss Lamarr, a Gaudi-based framework to speed-up the simulation production parameterizing both the detector response and the reconstruction algorithms of the LHCb experiment. Deep Generative Models powered by several algorithms and strategies are employed to effectively parameterize the high-level response of the single components of the LHCb detector, encoding within neural networks the experimental errors and uncertainties introduced in the detection and reconstruction phases. Where possible, models are trained directly on real data, statistically subtracting any background components by applying appropriate reweighing procedures. Embedding Lamarr in the general LHCb Gauss Simulation framework allows to combine its execution with any of the available generators in a seamless way. The resulting software package enables a simulation process independent of the detailed simulation used to date.
Streaming Transformer ASR with Blockwise Synchronous Beam Search
The Transformer self-attention network has shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the entire input sequence is required to compute both self-attention and source--target attention. In this paper, we propose a novel blockwise synchronous beam search algorithm based on blockwise processing of encoder to perform streaming E2E Transformer ASR. In the beam search, encoded feature blocks are synchronously aligned using a block boundary detection technique, where a reliability score of each predicted hypothesis is evaluated based on the end-of-sequence and repeated tokens in the hypothesis. Evaluations of the HKUST and AISHELL-1 Mandarin, LibriSpeech English, and CSJ Japanese tasks show that the proposed streaming Transformer algorithm outperforms conventional online approaches, including monotonic chunkwise attention (MoChA), especially when using the knowledge distillation technique. An ablation study indicates that our streaming approach contributes to reducing the response time, and the repetition criterion contributes significantly in certain tasks. Our streaming ASR models achieve comparable or superior performance to batch models and other streaming-based Transformer methods in all tasks considered.
Trend-Based SAC Beam Control Method with Zero-Shot in Superconducting Linear Accelerator
The superconducting linear accelerator is a highly flexiable facility for modern scientific discoveries, necessitating weekly reconfiguration and tuning. Accordingly, minimizing setup time proves essential in affording users with ample experimental time. We propose a trend-based soft actor-critic(TBSAC) beam control method with strong robustness, allowing the agents to be trained in a simulated environment and applied to the real accelerator directly with zero-shot. To validate the effectiveness of our method, two different typical beam control tasks were performed on China Accelerator Facility for Superheavy Elements (CAFe II) and a light particle injector(LPI) respectively. The orbit correction tasks were performed in three cryomodules in CAFe II seperately, the time required for tuning has been reduced to one-tenth of that needed by human experts, and the RMS values of the corrected orbit were all less than 1mm. The other transmission efficiency optimization task was conducted in the LPI, our agent successfully optimized the transmission efficiency of radio-frequency quadrupole(RFQ) to over 85% within 2 minutes. The outcomes of these two experiments offer substantiation that our proposed TBSAC approach can efficiently and effectively accomplish beam commissioning tasks while upholding the same standard as skilled human experts. As such, our method exhibits potential for future applications in other accelerator commissioning fields.
Localization-Based Beam Focusing in Near-Field Communications
Shifting 6G-and-beyond wireless communication systems to higher frequency bands and the utilization of massive multiple-input multiple-output arrays will extend the near-field region, affecting beamforming and user localization schemes. In this paper, we propose a localization-based beam-focusing strategy that leverages the dominant line-of-sight (LoS) propagation arising at mmWave and sub-THz frequencies. To support this approach, we analyze the 2D-MUSIC algorithm for distance estimation by examining its spectrum in simplified, tractable setups with minimal numbers of antennas and users. Lastly, we compare the proposed localization-based beam focusing, with locations estimated via 2D-MUSIC, with zero forcing with pilot-based channel estimation in terms of uplink sum spectral efficiency. Our numerical results show that the proposed method becomes more effective under LoS-dominated propagation, short coherence blocks, and strong noise power arising at high carrier frequencies and with large bandwidths.
Neural Modulation Fields for Conditional Cone Beam Neural Tomography
Conventional Computed Tomography (CT) methods require large numbers of noise-free projections for accurate density reconstructions, limiting their applicability to the more complex class of Cone Beam Geometry CT (CBCT) reconstruction. Recently, deep learning methods have been proposed to overcome these limitations, with methods based on neural fields (NF) showing strong performance, by approximating the reconstructed density through a continuous-in-space coordinate based neural network. Our focus is on improving such methods, however, unlike previous work, which requires training an NF from scratch for each new set of projections, we instead propose to leverage anatomical consistencies over different scans by training a single conditional NF on a dataset of projections. We propose a novel conditioning method where local modulations are modeled per patient as a field over the input domain through a Neural Modulation Field (NMF). The resulting Conditional Cone Beam Neural Tomography (CondCBNT) shows improved performance for both high and low numbers of available projections on noise-free and noisy data.
What Do You Get When You Cross Beam Search with Nucleus Sampling?
We combine beam search with the probabilistic pruning technique of nucleus sampling to create two deterministic nucleus search algorithms for natural language generation. The first algorithm, p-exact search, locally prunes the next-token distribution and performs an exact search over the remaining space. The second algorithm, dynamic beam search, shrinks and expands the beam size according to the entropy of the candidate's probability distribution. Despite the probabilistic intuition behind nucleus search, experiments on machine translation and summarization benchmarks show that both algorithms reach the same performance levels as standard beam search.
ImagePairs: Realistic Super Resolution Dataset via Beam Splitter Camera Rig
Super Resolution is the problem of recovering a high-resolution image from a single or multiple low-resolution images of the same scene. It is an ill-posed problem since high frequency visual details of the scene are completely lost in low-resolution images. To overcome this, many machine learning approaches have been proposed aiming at training a model to recover the lost details in the new scenes. Such approaches include the recent successful effort in utilizing deep learning techniques to solve super resolution problem. As proven, data itself plays a significant role in the machine learning process especially deep learning approaches which are data hungry. Therefore, to solve the problem, the process of gathering data and its formation could be equally as vital as the machine learning technique used. Herein, we are proposing a new data acquisition technique for gathering real image data set which could be used as an input for super resolution, noise cancellation and quality enhancement techniques. We use a beam-splitter to capture the same scene by a low resolution camera and a high resolution camera. Since we also release the raw images, this large-scale dataset could be used for other tasks such as ISP generation. Unlike current small-scale dataset used for these tasks, our proposed dataset includes 11,421 pairs of low-resolution high-resolution images of diverse scenes. To our knowledge this is the most complete dataset for super resolution, ISP and image quality enhancement. The benchmarking result shows how the new dataset can be successfully used to significantly improve the quality of real-world image super resolution.
Is your stochastic signal really detectable?
Separating a stochastic gravitational wave background (SGWB) from noise is a challenging statistical task. One approach to establishing a detection criterion for the SGWB is using Bayesian evidence. If the evidence ratio (Bayes factor) between models with and without the signal exceeds a certain threshold, the signal is considered detected. We present a formalism to compute the averaged Bayes factor, incorporating instrumental-noise and SGWB uncertainties. As an example, we consider the case of power-law-shaped SGWB in LISA and generate the corresponding bayesian sensitivity curve. Unlike existing methods in the literature, which typically neglect uncertainties in both the signal and noise, our approach provides a reliable and realistic alternative. This flexible framework opens avenues for more robust stochastic gravitational wave background detection across gravitational-wave experiments.
Weak localization in radiative transfer of acoustic waves in a randomly-fluctuating slab
This paper concerns the derivation of radiative transfer equations for acoustic waves propagating in a randomly fluctuating slab (between two parallel planes) in the weak-scattering regime, and the study of boundary effects through an asymptotic analysis of the Wigner transform of the wave solution. These radiative transfer equations allow to model the transport of wave energy density, taking into account the scattering by random heterogeneities. The approach builds on the method of images, where the slab is extended to a full-space, with a periodic map of mechanical properties and a series of sources located along a periodic pattern. Two types of boundary effects, both on the (small) scale of the wavelength, are observed: one at the boundaries of the slab, and one inside the domain. The former impact the entire energy density (coherent as well as incoherent) and is also observed in half-spaces. The latter, more specific to slabs, corresponds to the constructive interference of waves that have reflected at least twice on the boundaries of the slab and only impacts the coherent part of the energy density.
Bayesian Updates Compose Optically
Bayes' rule tells us how to invert a causal process in order to update our beliefs in light of new evidence. If the process is believed to have a complex compositional structure, we may ask whether composing the inversions of the component processes gives the same belief update as the inversion of the whole. We answer this question affirmatively, showing that the relevant compositional structure is precisely that of the lens pattern, and that we can think of Bayesian inversion as a particular instance of a state-dependent morphism in a corresponding fibred category. We define a general notion of (mixed) Bayesian lens, and discuss the (un)lawfulness of these lenses when their contravariant components are exact Bayesian inversions. We prove our main result both abstractly and concretely, for both discrete and continuous states, taking care to illustrate the common structures.
Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation
The rapid growth of biomedical knowledge has outpaced our ability to efficiently extract insights and generate novel hypotheses. Large language models (LLMs) have emerged as a promising tool to revolutionize knowledge interaction and potentially accelerate biomedical discovery. In this paper, we present a comprehensive evaluation of LLMs as biomedical hypothesis generators. We construct a dataset of background-hypothesis pairs from biomedical literature, carefully partitioned into training, seen, and unseen test sets based on publication date to mitigate data contamination. Using this dataset, we assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings. To enhance the exploration of uncertainty, a crucial aspect of scientific discovery, we incorporate tool use and multi-agent interactions in our evaluation framework. Furthermore, we propose four novel metrics grounded in extensive literature review to evaluate the quality of generated hypotheses, considering both LLM-based and human assessments. Our experiments yield two key findings: 1) LLMs can generate novel and validated hypotheses, even when tested on literature unseen during training, and 2) Increasing uncertainty through multi-agent interactions and tool use can facilitate diverse candidate generation and improve zero-shot hypothesis generation performance. However, we also observe that the integration of additional knowledge through few-shot learning and tool use may not always lead to performance gains, highlighting the need for careful consideration of the type and scope of external knowledge incorporated. These findings underscore the potential of LLMs as powerful aids in biomedical hypothesis generation and provide valuable insights to guide further research in this area.
Automatic Prompt Optimization with "Gradient Descent" and Beam Search
Large Language Models (LLMs) have shown impressive performance as general purpose agents, but their abilities remain highly dependent on prompts which are hand written with onerous trial-and-error effort. We propose a simple and nonparametric solution to this problem, Automatic Prompt Optimization (APO), which is inspired by numerical gradient descent to automatically improve prompts, assuming access to training data and an LLM API. The algorithm uses minibatches of data to form natural language ``gradients'' that criticize the current prompt. The gradients are then ``propagated'' into the prompt by editing the prompt in the opposite semantic direction of the gradient. These gradient descent steps are guided by a beam search and bandit selection procedure which significantly improves algorithmic efficiency. Preliminary results across three benchmark NLP tasks and the novel problem of LLM jailbreak detection suggest that Automatic Prompt Optimization can outperform prior prompt editing techniques and improve an initial prompt's performance by up to 31\%, by using data to rewrite vague task descriptions into more precise annotation instructions.
Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation
The end-to-end nature of neural machine translation (NMT) removes many ways of manually guiding the translation process that were available in older paradigms. Recent work, however, has introduced a new capability: lexically constrained or guided decoding, a modification to beam search that forces the inclusion of pre-specified words and phrases in the output. However, while theoretically sound, existing approaches have computational complexities that are either linear (Hokamp and Liu, 2017) or exponential (Anderson et al., 2017) in the number of constraints. We present a algorithm for lexically constrained decoding with a complexity of O(1) in the number of constraints. We demonstrate the algorithms remarkable ability to properly place these constraints, and use it to explore the shaky relationship between model and BLEU scores. Our implementation is available as part of Sockeye.
Allies: Prompting Large Language Model with Beam Search
With the advance of large language models (LLMs), the research field of LLM applications becomes more and more popular and the idea of constructing pipelines to accomplish complex tasks by stacking LLM API calls come true. However, this kind of methods face two limitations: narrow information coverage and low fault tolerance. In this work, we propose a novel method called ALLIES. Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query, enabling an iterative reasoning process. By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly obtainable through retrieval. We take zero-shot open-domain question answering (ODQA) as an application scene and evaluate ALLIES on the widely-used benchmarks, such as NQ, WebQ and TriviaQA. The experimental results demonstrate that ALLIES significantly outperforms other zero-shot baselines, indicating its effectiveness in tackling those challenges. Our code is available in https://github.com/microsoft/SimXNS/tree/main/ALLIES.
Roughness Tolerance Studies for the Undulator Beam Pipe Chamber of LCLS-II
We investigate the effect of wall roughness on the wakefield-induced energy variation in the undulator beam pipe of LCLS-II. We find that a wall roughness equivalent to an rms surface slope of 30 mr increases the total induced energy variation within the bunch (due to the resistive wall wake) by a modest 20%.
Accurate and robust methods for direct background estimation in resonant anomaly detection
Resonant anomaly detection methods have great potential for enhancing the sensitivity of traditional bump hunt searches. A key component of these methods is a high quality background template used to produce an anomaly score. Using the LHC Olympics R&D dataset, we demonstrate that this background template can also be repurposed to directly estimate the background expectation in a simple cut and count setup. In contrast to a traditional bump hunt, no fit to the invariant mass distribution is needed, thereby avoiding the potential problem of background sculpting. Furthermore, direct background estimation allows working with large background rejection rates, where resonant anomaly detection methods typically show their greatest improvement in significance.
Novel |V_{cb}| extraction method via boosted bc-tagging with in-situ calibration
We present a novel method for measuring |V_{cb}| at the LHC using an advanced boosted-jet tagger to identify "bc signatures". By associating boosted W rightarrow bc signals with bc-matched jets from top-quark decays, we enable an in-situ calibration of the tagger. This approach significantly suppresses backgrounds while reducing uncertainties in flavor tagging efficiencies -- key to improving measurement precision. Our study is enabled by the development of realistic, AI-based large- and small-radius taggers, Sophon and the newly introduced SophonAK4, validated to match ATLAS and CMS's state-of-the-art taggers. The method complements the conventional small radius jet approach and enables a ~30% improvement in |V_{cb}| precision under HL-LHC projections. As a byproduct, it enhances H^{pm} rightarrow bc search sensitivity by a factor of 2--5 over the recent ATLAS result based on Run 2 data. Our work offers a new perspective for the precision |V_{cb}| measurement and highlights the potential of using advanced tagging models to probe unexplored boosted regimes at the LHC.
Interpretation of excess in H to Z γ using a light axion-like particle
We interpret the recent excess in a rare decay of the Higgs boson, Hto Zgamma, using a light axion-like particle (ALP) in the massrange 0.05 - 0.1 GeV.The dominant decay of such a light ALP is into a pair of collimated photons, whose decay is required to happen before reaching the ECAL detector, such that it mimics a single photon in the detector. It can explain the excess with a coupling C^{rm eff}_{aZH} / Lambda sim 4 times 10^{-5};{rm GeV}^{-1}, while the decay of the ALP before reaching the ECAL requires the diphoton coupling C^{rm eff}_{gammagamma}/ Lambda ge 0.35 ,{rm TeV}^{-1} (0.1,{rm eV}/m_a)^2. A potential test would be the rare decay of the Z boson Z to a H^* to a (b bar b) at the Tera-Z option of the future FCC and CEPC. However, it has a branching ratio of only O(10^{-12}), and thus barely testable. The production cross section for pp to Z^* to a H via the same coupling C^{rm eff}_{aZH} / Lambda at the LHC is too small for detection.
mini-TimeCube as a Neutron Scatter Camera
We present Monte Carlo (MC) simulation results from a study of a compact plastic-scintillator detector suitable for imaging fast neutrons in the 1 -- 10 MeV energy range: the miniTimeCube (mTC). Originally designed for antineutrino detection, the mTC consists of 24 MultiChannel Plate (MCP) photodetectors surrounding a 13 cm cube of boron-doped plastic scintillator. Our simulation results show that waveform digitization of 1536 optically sensitive channels surrounding the scintillator should allow for spatiotemporal determination of individual neutron-proton scatters in the detector volume to thicksim100 picoseconds and thicksim5 mm. A Bayesian estimation framework is presented for multiple-scatter reconstruction, and is used to estimate the incoming direction and energy of simulated individual neutrons. Finally, we show how populations of reconstructed neutrons can be used to estimate the direction and energy spectrum of nearby simulated neutron sources.
Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach
Large language models (LLMs) have shown great promise for capturing contextual information in natural language processing tasks. We propose a novel approach to speaker diarization that incorporates the prowess of LLMs to exploit contextual cues in human dialogues. Our method builds upon an acoustic-based speaker diarization system by adding lexical information from an LLM in the inference stage. We model the multi-modal decoding process probabilistically and perform joint acoustic and lexical beam search to incorporate cues from both modalities: audio and text. Our experiments demonstrate that infusing lexical knowledge from the LLM into an acoustics-only diarization system improves overall speaker-attributed word error rate (SA-WER). The experimental results show that LLMs can provide complementary information to acoustic models for the speaker diarization task via proposed beam search decoding approach showing up to 39.8% relative delta-SA-WER improvement from the baseline system. Thus, we substantiate that the proposed technique is able to exploit contextual information that is inaccessible to acoustics-only systems which is represented by speaker embeddings. In addition, these findings point to the potential of using LLMs to improve speaker diarization and other speech processing tasks by capturing semantic and contextual cues.
Liver Segmentation using Turbolift Learning for CT and Cone-beam C-arm Perfusion Imaging
Model-based reconstruction employing the time separation technique (TST) was found to improve dynamic perfusion imaging of the liver using C-arm cone-beam computed tomography (CBCT). To apply TST using prior knowledge extracted from CT perfusion data, the liver should be accurately segmented from the CT scans. Reconstructions of primary and model-based CBCT data need to be segmented for proper visualisation and interpretation of perfusion maps. This research proposes Turbolift learning, which trains a modified version of the multi-scale Attention UNet on different liver segmentation tasks serially, following the order of the trainings CT, CBCT, CBCT TST - making the previous trainings act as pre-training stages for the subsequent ones - addressing the problem of limited number of datasets for training. For the final task of liver segmentation from CBCT TST, the proposed method achieved an overall Dice scores of 0.874pm0.031 and 0.905pm0.007 in 6-fold and 4-fold cross-validation experiments, respectively - securing statistically significant improvements over the model, which was trained only for that task. Experiments revealed that Turbolift not only improves the overall performance of the model but also makes it robust against artefacts originating from the embolisation materials and truncation artefacts. Additionally, in-depth analyses confirmed the order of the segmentation tasks. This paper shows the potential of segmenting the liver from CT, CBCT, and CBCT TST, learning from the available limited training data, which can possibly be used in the future for the visualisation and evaluation of the perfusion maps for the treatment evaluation of liver diseases.
DeepA2: A Modular Framework for Deep Argument Analysis with Pretrained Neural Text2Text Language Models
In this paper, we present and implement a multi-dimensional, modular framework for performing deep argument analysis (DeepA2) using current pre-trained language models (PTLMs). ArgumentAnalyst -- a T5 model (Raffel et al. 2020) set up and trained within DeepA2 -- reconstructs argumentative texts, which advance an informal argumentation, as valid arguments: It inserts, e.g., missing premises and conclusions, formalizes inferences, and coherently links the logical reconstruction to the source text. We create a synthetic corpus for deep argument analysis, and evaluate ArgumentAnalyst on this new dataset as well as on existing data, specifically EntailmentBank (Dalvi et al. 2021). Our empirical findings vindicate the overall framework and highlight the advantages of a modular design, in particular its ability to emulate established heuristics (such as hermeneutic cycles), to explore the model's uncertainty, to cope with the plurality of correct solutions (underdetermination), and to exploit higher-order evidence.
PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models
Physics problem-solving is a challenging domain for large AI models, requiring integration of conceptual understanding, mathematical reasoning, and interpretation of physical diagrams. Current evaluation methodologies show notable limitations in capturing the breadth and complexity of undergraduate-level physics, underscoring the need for more rigorous assessments. To this end, we present PhysUniBench, a large-scale multimodal benchmark designed to evaluate and improve the reasoning capabilities of multimodal large language models (MLLMs) specifically on undergraduate-level physics problems. PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagrams. The benchmark includes both open-ended and multiple-choice questions, systematically curated and difficulty-rated through an iterative model-in-the-loop process. The benchmark's construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels. Through extensive experiments, we observe that current state-of-the-art models encounter substantial challenges in physics reasoning. For example, GPT-4o mini achieves only about 34.2\% accuracy in the proposed PhysUniBench. These results highlight that current MLLMs struggle with advanced physics reasoning, especially on multi-step problems and those requiring precise diagram interpretation. By providing a broad and rigorous assessment tool, PhysUniBench aims to drive progress in AI for Science, encouraging the development of models with stronger physical reasoning, problem-solving skills, and multimodal understanding. The benchmark and evaluation scripts are available at https://prismax-team.github.io/PhysUniBenchmark/.
A Method to Simultaneously Facilitate All Jet Physics Tasks
Machine learning has become an essential tool in jet physics. Due to their complex, high-dimensional nature, jets can be explored holistically by neural networks in ways that are not possible manually. However, innovations in all areas of jet physics are proceeding in parallel. We show that specially constructed machine learning models trained for a specific jet classification task can improve the accuracy, precision, or speed of all other jet physics tasks. This is demonstrated by training on a particular multiclass generation and classification task and then using the learned representation for different generation and classification tasks, for datasets with a different (full) detector simulation, for jets from a different collision system (pp versus ep), for generative models, for likelihood ratio estimation, and for anomaly detection. We consider, our OmniLearn approach thus as a jet-physics foundation model. It is made publicly available for use in any area where state-of-the-art precision is required for analyses involving jets and their substructure.
PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models
Large language models (LLMs) have rapidly advanced and are increasingly capable of tackling complex scientific problems, including those in physics. Despite this progress, current LLMs often fail to emulate the concise, principle-based reasoning characteristic of human experts, instead generating lengthy and opaque solutions. This discrepancy highlights a crucial gap in their ability to apply core physical principles for efficient and interpretable problem solving. To systematically investigate this limitation, we introduce PhySense, a novel principle-based physics reasoning benchmark designed to be easily solvable by experts using guiding principles, yet deceptively difficult for LLMs without principle-first reasoning. Our evaluation across multiple state-of-the-art LLMs and prompt types reveals a consistent failure to align with expert-like reasoning paths, providing insights for developing AI systems with efficient, robust and interpretable principle-based scientific reasoning.
On the Relationship Between Explanation and Prediction: A Causal View
Being able to provide explanations for a model's decision has become a central requirement for the development, deployment, and adoption of machine learning models. However, we are yet to understand what explanation methods can and cannot do. How do upstream factors such as data, model prediction, hyperparameters, and random initialization influence downstream explanations? While previous work raised concerns that explanations (E) may have little relationship with the prediction (Y), there is a lack of conclusive study to quantify this relationship. Our work borrows tools from causal inference to systematically assay this relationship. More specifically, we study the relationship between E and Y by measuring the treatment effect when intervening on their causal ancestors, i.e., on hyperparameters and inputs used to generate saliency-based Es or Ys. Our results suggest that the relationships between E and Y is far from ideal. In fact, the gap between 'ideal' case only increase in higher-performing models -- models that are likely to be deployed. Our work is a promising first step towards providing a quantitative measure of the relationship between E and Y, which could also inform the future development of methods for E with a quantitative metric.
The LHCb ultra-fast simulation option, Lamarr: design and validation
Detailed detector simulation is the major consumer of CPU resources at LHCb, having used more than 90% of the total computing budget during Run 2 of the Large Hadron Collider at CERN. As data is collected by the upgraded LHCb detector during Run 3 of the LHC, larger requests for simulated data samples are necessary, and will far exceed the pledged resources of the experiment, even with existing fast simulation options. An evolution of technologies and techniques to produce simulated samples is mandatory to meet the upcoming needs of analysis to interpret signal versus background and measure efficiencies. In this context, we propose Lamarr, a Gaudi-based framework designed to offer the fastest solution for the simulation of the LHCb detector. Lamarr consists of a pipeline of modules parameterizing both the detector response and the reconstruction algorithms of the LHCb experiment. Most of the parameterizations are made of Deep Generative Models and Gradient Boosted Decision Trees trained on simulated samples or alternatively, where possible, on real data. Embedding Lamarr in the general LHCb Gauss Simulation framework allows combining its execution with any of the available generators in a seamless way. Lamarr has been validated by comparing key reconstructed quantities with Detailed Simulation. Good agreement of the simulated distributions is obtained with two-order-of-magnitude speed-up of the simulation phase.
Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation
Recent studies have revealed a number of pathologies of neural machine translation (NMT) systems. Hypotheses explaining these mostly suggest there is something fundamentally wrong with NMT as a model or its training algorithm, maximum likelihood estimation (MLE). Most of this evidence was gathered using maximum a posteriori (MAP) decoding, a decision rule aimed at identifying the highest-scoring translation, i.e. the mode. We argue that the evidence corroborates the inadequacy of MAP decoding more than casts doubt on the model and its training algorithm. In this work, we show that translation distributions do reproduce various statistics of the data well, but that beam search strays from such statistics. We show that some of the known pathologies and biases of NMT are due to MAP decoding and not to NMT's statistical assumptions nor MLE. In particular, we show that the most likely translations under the model accumulate so little probability mass that the mode can be considered essentially arbitrary. We therefore advocate for the use of decision rules that take into account the translation distribution holistically. We show that an approximation to minimum Bayes risk decoding gives competitive results confirming that NMT models do capture important aspects of translation well in expectation.
HoloBeam: Learning Optimal Beamforming in Far-Field Holographic Metasurface Transceivers
Holographic Metasurface Transceivers (HMTs) are emerging as cost-effective substitutes to large antenna arrays for beamforming in Millimeter and TeraHertz wave communication. However, to achieve desired channel gains through beamforming in HMT, phase-shifts of a large number of elements need to be appropriately set, which is challenging. Also, these optimal phase-shifts depend on the location of the receivers, which could be unknown. In this work, we develop a learning algorithm using a {\it fixed-budget multi-armed bandit framework} to beamform and maximize received signal strength at the receiver for far-field regions. Our algorithm, named \Algo exploits the parametric form of channel gains of the beams, which can be expressed in terms of two {\it phase-shifting parameters}. Even after parameterization, the problem is still challenging as phase-shifting parameters take continuous values. To overcome this, {\it\HB} works with the discrete values of phase-shifting parameters and exploits their unimodal relations with channel gains to learn the optimal values faster. We upper bound the probability of {\it\HB} incorrectly identifying the (discrete) optimal phase-shift parameters in terms of the number of pilots used in learning. We show that this probability decays exponentially with the number of pilot signals. We demonstrate that {\it\HB} outperforms state-of-the-art algorithms through extensive simulations.
Large Language Models can Learn Rules
When prompted with a few examples and intermediate steps, large language models (LLMs) have demonstrated impressive performance in various reasoning tasks. However, prompting methods that rely on implicit knowledge in an LLM often generate incorrect answers when the implicit knowledge is wrong or inconsistent with the task. To tackle this problem, we present Hypotheses-to-Theories (HtT), a framework that learns a rule library for reasoning with LLMs. HtT contains two stages, an induction stage and a deduction stage. In the induction stage, an LLM is first asked to generate and verify rules over a set of training examples. Rules that appear and lead to correct answers sufficiently often are collected to form a rule library. In the deduction stage, the LLM is then prompted to employ the learned rule library to perform reasoning to answer test questions. Experiments on relational reasoning, numerical reasoning and concept learning problems show that HtT improves existing prompting methods, with an absolute gain of 10-30% in accuracy. The learned rules are also transferable to different models and to different forms of the same problem.
Learning to Generate Novel Scientific Directions with Contextualized Literature-based Discovery
Literature-Based Discovery (LBD) aims to discover new scientific knowledge by mining papers and generating hypotheses. Standard LBD is limited to predicting pairwise relations between discrete concepts (e.g., drug-disease links), and ignores critical contexts like experimental settings (e.g., a specific patient population where a drug is evaluated) and background motivations (e.g., to find drugs without specific side effects). We address these limitations with a novel formulation of contextualized-LBD (C-LBD): generating scientific hypotheses in natural language, while grounding them in a context that controls the hypothesis search space. We present a modeling framework using retrieval of ``inspirations'' from past scientific papers. Our evaluations reveal that GPT-4 tends to generate ideas with overall low technical depth and novelty, while our inspiration prompting approaches partially mitigate this issue. Our work represents a first step toward building language models that generate new ideas derived from scientific literature.
Planck 2018 results. V. CMB power spectra and likelihoods
This paper describes the 2018 Planck CMB likelihoods, following a hybrid approach similar to the 2015 one, with different approximations at low and high multipoles, and implementing several methodological and analysis refinements. With more realistic simulations, and better correction and modelling of systematics, we can now make full use of the High Frequency Instrument polarization data. The low-multipole 100x143 GHz EE cross-spectrum constrains the reionization optical-depth parameter tau to better than 15% (in combination with with the other low- and high-ell likelihoods). We also update the 2015 baseline low-ell joint TEB likelihood based on the Low Frequency Instrument data, which provides a weaker tau constraint. At high multipoles, a better model of the temperature-to-polarization leakage and corrections for the effective calibrations of the polarization channels (polarization efficiency or PE) allow us to fully use the polarization spectra, improving the constraints on the LambdaCDM parameters by 20 to 30% compared to TT-only constraints. Tests on the modelling of the polarization demonstrate good consistency, with some residual modelling uncertainties, the accuracy of the PE modelling being the main limitation. Using our various tests, simulations, and comparison between different high-ell implementations, we estimate the consistency of the results to be better than the 0.5sigma level. Minor curiosities already present before (differences between ell<800 and ell>800 parameters or the preference for more smoothing of the C_ell peaks) are shown to be driven by the TT power spectrum and are not significantly modified by the inclusion of polarization. Overall, the legacy Planck CMB likelihoods provide a robust tool for constraining the cosmological model and represent a reference for future CMB observations. (Abridged)
Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search
Abstractive summarization systems today produce fluent and relevant output, but often "hallucinate" statements not supported by the source text. We analyze the connection between hallucinations and training data, and find evidence that models hallucinate because they train on target summaries that are unsupported by the source. Based on our findings, we present PINOCCHIO, a new decoding method that improves the consistency of a transformer-based abstractive summarizer by constraining beam search to avoid hallucinations. Given the model states and outputs at a given step, PINOCCHIO detects likely model hallucinations based on various measures of attribution to the source text. PINOCCHIO backtracks to find more consistent output, and can opt to produce no summary at all when no consistent generation can be found. In experiments, we find that PINOCCHIO improves the consistency of generation (in terms of F1) by an average of~67% on two abstractive summarization datasets.
The Machine Learning Landscape of Top Taggers
Based on the established task of identifying boosted, hadronically decaying top quarks, we compare a wide range of modern machine learning approaches. Unlike most established methods they rely on low-level input, for instance calorimeter output. While their network architectures are vastly different, their performance is comparatively similar. In general, we find that these new approaches are extremely powerful and great fun.
Experimental Estimation of Quantum State Properties from Classical Shadows
Full quantum tomography of high-dimensional quantum systems is experimentally infeasible due to the exponential scaling of the number of required measurements on the number of qubits in the system. However, several ideas were proposed recently for predicting the limited number of features for these states, or estimating the expectation values of operators, without the need for full state reconstruction. These ideas go under the general name of shadow tomography. Here we provide an experimental demonstration of property estimation based on classical shadows proposed in [H.-Y. Huang, R. Kueng, J. Preskill. Nat. Phys. https://doi.org/10.1038/s41567-020-0932-7 (2020)] and study its performance in the quantum optical experiment with high-dimensional spatial states of photons. We show on experimental data how this procedure outperforms conventional state reconstruction in fidelity estimation from a limited number of measurements.
Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics
We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology. The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level. These problems are novel in the sense that they do not come from public problem collections. We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen. While we find impressive progress in model performance with the most recent models, our research-level difficulty problems are mostly unsolved. We address challenges of auto-verifiability and grading, and discuss common failure modes. While currently state-of-the art models are still of limited use for researchers, our results show that AI assisted theoretical physics research may become possible in the near future. We discuss the main obstacles towards this goal and possible strategies to overcome them. The public problems and solutions, results for various models, and updates to the data set and score distribution, are available on the website of the dataset tpbench.org.
Neural Networks for cosmological model selection and feature importance using Cosmic Microwave Background data
The measurements of the temperature and polarisation anisotropies of the Cosmic Microwave Background (CMB) by the ESA Planck mission have strongly supported the current concordance model of cosmology. However, the latest cosmological data release from ESA Planck mission still has a powerful potential to test new data science algorithms and inference techniques. In this paper, we use advanced Machine Learning (ML) algorithms, such as Neural Networks (NNs), to discern among different underlying cosmological models at the angular power spectra level, using both temperature and polarisation Planck 18 data. We test two different models beyond LambdaCDM: a modified gravity model: the Hu-Sawicki model, and an alternative inflationary model: a feature-template in the primordial power spectrum. Furthermore, we also implemented an interpretability method based on SHAP values to evaluate the learning process and identify the most relevant elements that drive our architecture to certain outcomes. We find that our NN is able to distinguish between different angular power spectra successfully for both alternative models and LambdaCDM. We conclude by explaining how archival scientific data has still a strong potential to test novel data science algorithms that are interesting for the next generation of cosmological experiments.
Statistical selection of high-redshift, neutral-hydrogen-rich, lensed galaxies with the Square Kilometre Array
Deep wide spectral line surveys with the Square Kilometre Array (SKA) will expand the cosmic frontiers of neutral atomic hydrogen (HI) in galaxies. However, at cosmologically significant redshifts (z gtrsim 0.5), detections will typically be spatially unresolved and limited to the highest mass systems. Gravitational lensing could potentially alleviate these limitations, enabling lower mass systems to be studied at higher redshift and spatially resolved dynamical studies of some HI discs. Additionally, lensed HI systems would select foreground dark matter haloes using a different, more extended baryonic tracer compared to other lens surveys. This may result in a wider selected range of foreground dark matter halo properties, such as the concentration parameter. This paper uses the distortion of the observed HI mass function (HIMF) produced by strong gravitational lensing to find a flux density criterion for selecting lensed HI sources in future SKA-Mid spectral line surveys. This selection approach could yield lensed HI source densities in the range of sim 0.1--10 galaxies per square degree out to a redshift of z simeq 3 covered by SKA-MID Band 1. Although the sample sizes are modest, even with the proposed SKA-Mid surveys, the selection approach is straightforward and should have a 50% efficiency without any additional information, such as low-impact-factor or lower-redshift massive galaxies. The efficiency of selecting high-redshift, neutral-hydrogen-rich, lensed galaxies should then be greatly enhanced by using SKA-MID data in concert with the Vera C. Rubin Large Survey of Space and Time.
Evidence of Nonlinear Signatures in Solar Wind Proton Density at the L1 Lagrange point
The solar wind is a medium characterized by strong turbulence and significant field fluctuations on various scales. Recent observations have revealed that magnetic turbulence exhibits a self-similar behavior. Similarly, high-resolution measurements of the proton density have shown comparable characteristics, prompting several studies into the multifractal properties of these density fluctuations. In this work, we show that low-resolution observations of the solar wind proton density over time, recorded by various spacecraft at Lagrange point L1, also exhibit non-linear and multifractal structures. The novelty of our study lies in the fact that this is the first systematic analysis of solar wind proton density using low-resolution (hourly) data collected by multiple spacecraft at the L1 Lagrange point over a span of 17 years. Furthermore, we interpret our results within the framework of non-extensive statistical mechanics, which appears to be consistent with the observed nonlinear behavior. Based on the data, we successfully validate the q-triplet predicted by non-extensive statistical theory. To the best of our knowledge, this represents the most rigorous and systematic validation to date of the q-triplet in the solar wind.
Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures. Our analysis reveals three critical issues: (1) Tower of Hanoi experiments systematically exceed model output token limits at reported failure points, with models explicitly acknowledging these constraints in their outputs; (2) The authors' automated evaluation framework fails to distinguish between reasoning failures and practical constraints, leading to misclassification of model capabilities; (3) Most concerningly, their River Crossing benchmarks include mathematically impossible instances for N > 5 due to insufficient boat capacity, yet models are scored as failures for not solving these unsolvable problems. When we control for these experimental artifacts, by requesting generating functions instead of exhaustive move lists, preliminary experiments across multiple models indicate high accuracy on Tower of Hanoi instances previously reported as complete failures. These findings highlight the importance of careful experimental design when evaluating AI reasoning capabilities.
A Review of NEST Models for Liquid Xenon and Exhaustive Comparison to Other Approaches
This paper will discuss the microphysical simulation of interactions in liquid xenon, the active detector medium in many leading rare-event searches for new physics, and describe experimental observables useful for understanding detector performance. The scintillation and ionization yield distributions for signal and background will be presented using the Noble Element Simulation Technique (NEST), which is a toolkit based on experimental data and simple, empirical formulae, which mimic previous microphysics modeling, but are guided by data. The NEST models for light and charge production as a function of the particle type, energy, and electric field will be reviewed, as well as models for energy resolution and final pulse areas. NEST will be compared to other models or sets of models, and vetted against real data, with several specific examples pulled from XENON, ZEPLIN, LUX, LZ, PandaX, and table-top experiments used for calibrations.
Bounds on geometric wakefields in collimators and step transitions of arbitrary cross sections
We present the wakefield conformal mapping technique that can be readily applied to the analysis of the radiation generated by an ultra-relativistic particle in the step transition and a collimator. We derive simple analytical expressions for the lower and upper bounds of both longitudinal and transverse wake potentials. We test the derived expressions against well-known formulas in several representative examples. The proposed method can greatly simplify the optimization of collimating sections, as well as become a useful tool in the shape optimization problems.
MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search
Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the novel task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs' capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM's internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent chemistry literature show that our method consistently outperforms strong baselines.
Probing neural language models for understanding of words of estimative probability
Words of estimative probability (WEP) are expressions of a statement's plausibility (probably, maybe, likely, doubt, likely, unlikely, impossible...). Multiple surveys demonstrate the agreement of human evaluators when assigning numerical probability levels to WEP. For example, highly likely corresponds to a median chance of 0.90+-0.08 in Fagen-Ulmschneider (2015)'s survey. In this work, we measure the ability of neural language processing models to capture the consensual probability level associated to each WEP. Firstly, we use the UNLI dataset (Chen et al., 2020) which associates premises and hypotheses with their perceived joint probability p, to construct prompts, e.g. "[PREMISE]. [WEP], [HYPOTHESIS]." and assess whether language models can predict whether the WEP consensual probability level is close to p. Secondly, we construct a dataset of WEP-based probabilistic reasoning, to test whether language models can reason with WEP compositions. When prompted "[EVENTA] is likely. [EVENTB] is impossible.", a causal language model should not express that [EVENTA&B] is likely. We show that both tasks are unsolved by off-the-shelf English language models, but that fine-tuning leads to transferable improvement.
From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems
Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in this domain. Specifically, we organize the relevant studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication. Hypothesis formulation involves knowledge synthesis and hypothesis generation. Hypothesis validation includes the verification of scientific claims, theorem proving, and experiment validation. Manuscript publication encompasses manuscript writing and the peer review process. Furthermore, we identify and discuss the current challenges faced in these areas, as well as potential future directions for research. Finally, we also offer a comprehensive overview of existing benchmarks and tools across various domains that support the integration of AI into the research process. We hope this paper serves as an introduction for beginners and fosters future research. Resources have been made publicly available at https://github.com/zkzhou126/AI-for-Research.
Theoretical Antineutrino Detection, Direction and Ranging at Long Distances
In this paper we introduce the concept of what we call "NUDAR" (NeUtrino Direction and Ranging), making the point that measurements of the observed energy and direction vectors can be employed to passively deduce the exact three-dimensional location and thermal power of geophysical and anthropogenic neutrino sources from even a single detector. We present the most precise background estimates to date, all handled in full three dimensions, as functions of depth and geographical location. For the present calculations, we consider a hypothetical 138 kiloton detector which can be transported to an ocean site and deployed to an operational depth. We present a Bayesian estimation framework to incorporate any a priori knowledge of the reactor that we are trying to detect, as well as the estimated uncertainty in the background and the oscillation parameters. Most importantly, we fully employ the knowledge of the reactor spectrum and the distance-dependent effects of neutrino oscillations on such spectra. The latter, in particular, makes possible determination of range from one location, given adequate signal statistics. Further, we explore the rich potential of improving detection with even modest improvements in individual neutrino direction determination. We conclude that a 300 MWth reactor can indeed be geolocated, and its operating power estimated with one or two detectors in the hundred kiloton class at ranges out to a few hundred kilometers. We note that such detectors would have natural and non-interfering utility for scientific studies of geo-neutrinos, neutrino oscillations, and astrophysical neutrinos. This motivates the development of cost effective methods of constructing and deploying such next generation detectors.
A new type of Neutrino Detector for Sterile Neutrino Search at Nuclear Reactors and Nuclear Nonproliferation Applications
We describe a new detector, called NuLat, to study electron anti-neutrinos a few meters from a nuclear reactor, and search for anomalous neutrino oscillations. Such oscillations could be caused by sterile neutrinos, and might explain the "Reactor Antineutrino Anomaly". NuLat, is made possible by a natural synergy between the miniTimeCube and mini-LENS programs described in this paper. It features a "Raghavan Optical Lattice" (ROL) consisting of 3375 boron or ^6Li loaded plastic scintillator cubical cells 6.3\,cm (2.500") on a side. Cell boundaries have a 0.127\,mm (0.005") air gap, resulting in total internal reflection guiding most of the light down the 3 cardinal directions. The ROL detector technology for NuLat gives excellent spatial and energy resolution and allows for in-depth event topology studies. These features allow us to discern inverse beta decay (IBD) signals and the putative oscillation pattern, even in the presence of other backgrounds. We discuss here test venues, efficiency, sensitivity and project status.
Lake- and Surface-Based Detectors for Forward Neutrino Physics
We propose two medium-baseline, kiloton-scale neutrino experiments to study neutrinos from LHC proton-proton collisions: SINE, a surface-based scintillator panel detector observing muon neutrinos from the CMS interaction point, and UNDINE, a water Cherenkov detector submerged in lake Geneva observing all-flavor neutrinos from LHCb. Using a Monte Carlo simulation, we estimate millions of neutrino interactions during the high-luminosity LHC era. We show that these datasets can constrain neutrino cross sections, charm production in pp collisions, and strangeness enhancement as a solution to the cosmic-ray muon puzzle. SINE and UNDINE thus offer a cost-effective medium-baseline complement to the proposed short-baseline forward physics facility.
Digital Discovery of interferometric Gravitational Wave Detectors
Gravitational waves, detected a century after they were first theorized, are spacetime distortions caused by some of the most cataclysmic events in the universe, including black hole mergers and supernovae. The successful detection of these waves has been made possible by ingenious detectors designed by human experts. Beyond these successful designs, the vast space of experimental configurations remains largely unexplored, offering an exciting territory potentially rich in innovative and unconventional detection strategies. Here, we demonstrate the application of artificial intelligence (AI) to systematically explore this enormous space, revealing novel topologies for gravitational wave (GW) detectors that outperform current next-generation designs under realistic experimental constraints. Our results span a broad range of astrophysical targets, such as black hole and neutron star mergers, supernovae, and primordial GW sources. Moreover, we are able to conceptualize the initially unorthodox discovered designs, emphasizing the potential of using AI algorithms not only in discovering but also in understanding these novel topologies. We've assembled more than 50 superior solutions in a publicly available Gravitational Wave Detector Zoo which could lead to many new surprising techniques. At a bigger picture, our approach is not limited to gravitational wave detectors and can be extended to AI-driven design of experiments across diverse domains of fundamental physics.
Disintegration and Bayesian Inversion via String Diagrams
The notions of disintegration and Bayesian inversion are fundamental in conditional probability theory. They produce channels, as conditional probabilities, from a joint state, or from an already given channel (in opposite direction). These notions exist in the literature, in concrete situations, but are presented here in abstract graphical formulations. The resulting abstract descriptions are used for proving basic results in conditional probability theory. The existence of disintegration and Bayesian inversion is discussed for discrete probability, and also for measure-theoretic probability --- via standard Borel spaces and via likelihoods. Finally, the usefulness of disintegration and Bayesian inversion is illustrated in several examples.
Can LLMs Understand Time Series Anomalies?
Large Language Models (LLMs) have gained popularity in time series forecasting, but their potential for anomaly detection remains largely unexplored. Our study investigates whether LLMs can understand and detect anomalies in time series data, focusing on zero-shot and few-shot scenarios. Inspired by conjectures about LLMs' behavior from time series forecasting research, we formulate key hypotheses about LLMs' capabilities in time series anomaly detection. We design and conduct principled experiments to test each of these hypotheses. Our investigation reveals several surprising findings about LLMs for time series: 1. LLMs understand time series better as images rather than as text 2. LLMs did not demonstrate enhanced performance when prompted to engage in explicit reasoning about time series analysis 3. Contrary to common beliefs, LLM's understanding of time series do not stem from their repetition biases or arithmetic abilities 4. LLMs' behaviors and performance in time series analysis vary significantly across different model architectures This study provides the first comprehensive analysis of contemporary LLM capabilities in time series anomaly detection. Our results suggest that while LLMs can understand time series anomalies, many common conjectures based on their reasoning capabilities do not hold. Our code and data are available at `https://github.com/Rose-STL-Lab/AnomLLM/`.
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve
The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that in order to develop a holistic understanding of these systems we need to consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. This approach - which we call the teleological approach - leads us to identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. We predict that LLMs will achieve higher accuracy when these probabilities are high than when they are low - even in deterministic settings where probability should not matter. To test our predictions, we evaluate two LLMs (GPT-3.5 and GPT-4) on eleven tasks, and we find robust evidence that LLMs are influenced by probability in the ways that we have hypothesized. In many cases, the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability word sequence but only 13% when it is low-probability. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system - one that has been shaped by its own particular set of pressures.
Understanding Deep Networks via Extremal Perturbations and Smooth Masks
The problem of attribution is concerned with identifying the parts of an input that are responsible for a model's output. An important family of attribution methods is based on measuring the effect of perturbations applied to the input. In this paper, we discuss some of the shortcomings of existing approaches to perturbation analysis and address them by introducing the concept of extremal perturbations, which are theoretically grounded and interpretable. We also introduce a number of technical innovations to compute extremal perturbations, including a new area constraint and a parametric family of smooth perturbations, which allow us to remove all tunable hyper-parameters from the optimization problem. We analyze the effect of perturbations as a function of their area, demonstrating excellent sensitivity to the spatial properties of the deep neural network under stimulation. We also extend perturbation analysis to the intermediate layers of a network. This application allows us to identify the salient channels necessary for classification, which, when visualized using feature inversion, can be used to elucidate model behavior. Lastly, we introduce TorchRay, an interpretability library built on PyTorch.
What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models
Foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler's predictions of planetary motion later led to the discovery of Newtonian mechanics. However, evaluating whether these models truly capture deeper structure remains a challenge. We develop a technique for evaluating foundation models that examines how they adapt to synthetic datasets generated from some postulated world model. Our technique measures whether the foundation model's inductive bias aligns with the world model, and so we refer to it as an inductive bias probe. Across multiple domains, we find that foundation models can excel at their training tasks yet fail to develop inductive biases towards the underlying world model when adapted to new tasks. We particularly find that foundation models trained on orbital trajectories consistently fail to apply Newtonian mechanics when adapted to new physics tasks. Further analysis reveals that these models behave as if they develop task-specific heuristics that fail to generalize.
Light Schrödinger Bridge
Despite the recent advances in the field of computational Schr\"odinger Bridges (SB), most existing SB solvers are still heavy-weighted and require complex optimization of several neural networks. It turns out that there is no principal solver which plays the role of simple-yet-effective baseline for SB just like, e.g., k-means method in clustering, logistic regression in classification or Sinkhorn algorithm in discrete optimal transport. We address this issue and propose a novel fast and simple SB solver. Our development is a smart combination of two ideas which recently appeared in the field: (a) parameterization of the Schr\"odinger potentials with sum-exp quadratic functions and (b) viewing the log-Schr\"odinger potentials as the energy functions. We show that combined together these ideas yield a lightweight, simulation-free and theoretically justified SB solver with a simple straightforward optimization objective. As a result, it allows solving SB in moderate dimensions in a matter of minutes on CPU without a painful hyperparameter selection. Our light solver resembles the Gaussian mixture model which is widely used for density estimation. Inspired by this similarity, we also prove an important theoretical result showing that our light solver is a universal approximator of SBs. Furthemore, we conduct the analysis of the generalization error of our light solver. The code for our solver can be found at https://github.com/ngushchin/LightSB
Advancing AI-Scientist Understanding: Making LLM Think Like a Physicist with Interpretable Reasoning
Large Language Models (LLMs) are playing an expanding role in physics research by enhancing reasoning, symbolic manipulation, and numerical computation. However, ensuring the reliability and interpretability of their outputs remains a significant challenge. In our framework, we conceptualize the collaboration between AI and human scientists as a dynamic interplay among three modules: the reasoning module, the interpretation module, and the AI-scientist interaction module. Recognizing that effective physics reasoning demands rigorous logical consistency, quantitative precision, and deep integration with established theoretical models, we introduce the interpretation module to improve the understanding of AI-generated outputs, which is not previously explored in the literature. This module comprises multiple specialized agents, including summarizers, model builders, UI builders, and testers, which collaboratively structure LLM outputs within a physically grounded framework, by constructing a more interpretable science model. A case study demonstrates that our approach enhances transparency, facilitates validation, and strengthens AI-augmented reasoning in scientific discovery.
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them
Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 language models, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain). We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors). We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy large language models.
The Test of Tests: A Framework For Differentially Private Hypothesis Testing
We present a generic framework for creating differentially private versions of any hypothesis test in a black-box way. We analyze the resulting tests analytically and experimentally. Most crucially, we show good practical performance for small data sets, showing that at epsilon = 1 we only need 5-6 times as much data as in the fully public setting. We compare our work to the one existing framework of this type, as well as to several individually-designed private hypothesis tests. Our framework is higher power than other generic solutions and at least competitive with (and often better than) individually-designed tests.
Hypothesis Search: Inductive Reasoning with Language Models
Inductive reasoning is a core problem-solving capacity: humans can identify underlying principles from a few examples, which can then be robustly generalized to novel scenarios. Recent work has evaluated large language models (LLMs) on inductive reasoning tasks by directly prompting them yielding "in context learning." This can work well for straightforward inductive tasks, but performs very poorly on more complex tasks such as the Abstraction and Reasoning Corpus (ARC). In this work, we propose to improve the inductive reasoning ability of LLMs by generating explicit hypotheses at multiple levels of abstraction: we prompt the LLM to propose multiple abstract hypotheses about the problem, in natural language, then implement the natural language hypotheses as concrete Python programs. These programs can be directly verified by running on the observed examples and generalized to novel inputs. Because of the prohibitive cost of generation with state-of-the-art LLMs, we consider a middle step to filter the set of hypotheses that will be implemented into programs: we either ask the LLM to summarize into a smaller set of hypotheses, or ask human annotators to select a subset of the hypotheses. We verify our pipeline's effectiveness on the ARC visual inductive reasoning benchmark, its variant 1D-ARC, and string transformation dataset SyGuS. On a random 40-problem subset of ARC, our automated pipeline using LLM summaries achieves 27.5% accuracy, significantly outperforming the direct prompting baseline (accuracy of 12.5%). With the minimal human input of selecting from LLM-generated candidates, the performance is boosted to 37.5%. (And we argue this is a lower bound on the performance of our approach without filtering.) Our ablation studies show that abstract hypothesis generation and concrete program representations are both beneficial for LLMs to perform inductive reasoning tasks.
To Believe or Not to Believe Your LLM
We explore uncertainty quantification in large language models (LLMs), with the goal to identify when uncertainty in responses given a query is large. We simultaneously consider both epistemic and aleatoric uncertainties, where the former comes from the lack of knowledge about the ground truth (such as about facts or the language), and the latter comes from irreducible randomness (such as multiple possible answers). In particular, we derive an information-theoretic metric that allows to reliably detect when only epistemic uncertainty is large, in which case the output of the model is unreliable. This condition can be computed based solely on the output of the model obtained simply by some special iterative prompting based on the previous responses. Such quantification, for instance, allows to detect hallucinations (cases when epistemic uncertainty is high) in both single- and multi-answer responses. This is in contrast to many standard uncertainty quantification strategies (such as thresholding the log-likelihood of a response) where hallucinations in the multi-answer case cannot be detected. We conduct a series of experiments which demonstrate the advantage of our formulation. Further, our investigations shed some light on how the probabilities assigned to a given output by an LLM can be amplified by iterative prompting, which might be of independent interest.
LLMs Will Always Hallucinate, and We Need to Live With This
As Large Language Models become more ubiquitous across domains, it becomes important to examine their inherent limitations critically. This work argues that hallucinations in language models are not just occasional errors but an inevitable feature of these systems. We demonstrate that hallucinations stem from the fundamental mathematical and logical structure of LLMs. It is, therefore, impossible to eliminate them through architectural improvements, dataset enhancements, or fact-checking mechanisms. Our analysis draws on computational theory and Godel's First Incompleteness Theorem, which references the undecidability of problems like the Halting, Emptiness, and Acceptance Problems. We demonstrate that every stage of the LLM process-from training data compilation to fact retrieval, intent classification, and text generation-will have a non-zero probability of producing hallucinations. This work introduces the concept of Structural Hallucination as an intrinsic nature of these systems. By establishing the mathematical certainty of hallucinations, we challenge the prevailing notion that they can be fully mitigated.
Do Language Models Know When They're Hallucinating References?
State-of-the-art language models (LMs) are notoriously susceptible to generating hallucinated information. Such inaccurate outputs not only undermine the reliability of these models but also limit their use and raise serious concerns about misinformation and propaganda. In this work, we focus on hallucinated book and article references and present them as the "model organism" of language model hallucination research, due to their frequent and easy-to-discern nature. We posit that if a language model cites a particular reference in its output, then it should ideally possess sufficient information about its authors and content, among other relevant details. Using this basic insight, we illustrate that one can identify hallucinated references without ever consulting any external resources, by asking a set of direct or indirect queries to the language model about the references. These queries can be considered as "consistency checks." Our findings highlight that while LMs, including GPT-4, often produce inconsistent author lists for hallucinated references, they also often accurately recall the authors of real references. In this sense, the LM can be said to "know" when it is hallucinating references. Furthermore, these findings show how hallucinated references can be dissected to shed light on their nature. Replication code and results can be found at https://github.com/microsoft/hallucinated-references.
Optimizing the L-σ Relation of HII Galaxies for Improving Cosmological Application
The basic premise of using HII starburst galaxies (HIIGs) as cosmic "standard candels" is that there is a significant correlation between the Hbeta luminosity (L) and the velocity dispersion (sigma) of the ionized gas from HIIGs measurements, which can be called as the empirical L - sigma relation. However, the scaling L - sigma relation well-calibrated with the lower-redshift HIIGs is unfitted for the higher-redshift HIIGs. To solve this problem, we explore new relational expression for the L - sigma relation which should be suitable for both lower-redshift and higher-redshift HIIGs. After reconstructing the Hubble diagram with the Gaussian process (GP) method from the Pantheon+ supernovae Ia sample, we examine and compare six different revised formulas of L - sigma relation. Furthermore, we use the Bayesian evidence to compare the revised L - sigma relations with the analysis of a joint sample of 36 giant extragalactic HII regions (GEHRs) and 145 HIIGs. It turns out that the redshift-dependent bilinear correction and the quadratic sigma based correction are significantly better than the others. Moreover, a quadratic sigma based correction is the most supported one. It suggests that the appropriate corrections to the L - sigma relation should be considered when the HIIGs are used as a kind of cosmological probes.
A category theory framework for Bayesian learning
Inspired by the foundational works by Spivak and Fong and Cruttwell et al., we introduce a categorical framework to formalize Bayesian inference and learning. The two key ideas at play here are the notions of Bayesian inversions and the functor GL as constructed by Cruttwell et al.. In this context, we find that Bayesian learning is the simplest case of the learning paradigm. We then obtain categorical formulations of batch and sequential Bayes updates while also verifying that the two coincide in a specific example.
Why is AI hard and Physics simple?
We discuss why AI is hard and why physics is simple. We discuss how physical intuition and the approach of theoretical physics can be brought to bear on the field of artificial intelligence and specifically machine learning. We suggest that the underlying project of machine learning and the underlying project of physics are strongly coupled through the principle of sparsity, and we call upon theoretical physicists to work on AI as physicists. As a first step in that direction, we discuss an upcoming book on the principles of deep learning theory that attempts to realize this approach.
Spectral properties of bottomonium at high temperature: a systematic investigation
We investigate spectral features of bottomonium at high temperature, in particular the thermal mass shift and width of ground state S-wave and P-wave state. We employ and compare a range of methods for determining these features from lattice NRQCD correlators, including direct correlator analyses (multi-exponential fits and moments of spectral functions), linear methods (Backus-Gilbert, Tikhonov and HLT methods), and Bayesian methods for spectral function reconstruction (MEM and BR). We comment on the reliability and limitations of the various methods.
Response: Emergent analogical reasoning in large language models
In their recent Nature Human Behaviour paper, "Emergent analogical reasoning in large language models," (Webb, Holyoak, and Lu, 2023) the authors argue that "large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems." In this response, we provide counterexamples of the letter string analogies. In our tests, GPT-3 fails to solve even the easiest variants of the problems presented in the original paper. Zero-shot reasoning is an extraordinary claim that requires extraordinary evidence. We do not see that evidence in our experiments. To strengthen claims of humanlike reasoning such as zero-shot reasoning, it is important that the field develop approaches that rule out data memorization.
Modeling the Machine Learning Multiverse
Amid mounting concern about the reliability and credibility of machine learning research, we present a principled framework for making robust and generalizable claims: the multiverse analysis. Our framework builds upon the multiverse analysis (Steegen et al., 2016) introduced in response to psychology's own reproducibility crisis. To efficiently explore high-dimensional and often continuous ML search spaces, we model the multiverse with a Gaussian Process surrogate and apply Bayesian experimental design. Our framework is designed to facilitate drawing robust scientific conclusions about model performance, and thus our approach focuses on exploration rather than conventional optimization. In the first of two case studies, we investigate disputed claims about the relative merit of adaptive optimizers. Second, we synthesize conflicting research on the effect of learning rate on the large batch training generalization gap. For the machine learning community, the multiverse analysis is a simple and effective technique for identifying robust claims, for increasing transparency, and a step toward improved reproducibility.
Cosmic Calipers: Precise and Accurate Neutron Star Radius Measurements with Next-Generation Gravitational Wave Detectors
Gravitational waves from merging binary neutron stars carry characteristic information about their astrophysical properties, including masses and tidal deformabilities, that are needed to infer their radii. In this study, we use Bayesian inference to quantify the precision with which radius can inferred with upgrades in the current gravitational wave detectors and next-generation observatories such as the Einstein Telescope and Cosmic Explorer. We assign evidences for a set of plausible equations of state, which are then used as weights to obtain radius posteriors. We find that prior choices and the loudness of observed signals limit the precision and accuracy of inferred radii by current detectors. In contrast, next-generation observatories can resolve the radius precisely and accurately, across most of the mass range to within lesssim 5% for both soft and stiff equations of state. We also explore how the choice of the neutron star mass prior can influence the inferred masses and potentially affect radii measurements, finding that choosing an astrophysically motivated prior does not notably impact an individual neutron star's radius measurements.
Intensity statistics inside an open wave-chaotic cavity with broken time-reversal invariance
Using the supersymmetric method of random matrix theory within the Heidelberg approach framework we provide statistical description of stationary intensity sampled in locations inside an open wave-chaotic cavity, assuming that the time-reversal invariance inside the cavity is fully broken. In particular, we show that when incoming waves are fed via a finite number M of open channels the probability density {cal P}(I) for the single-point intensity I decays as a power law for large intensities: {cal P}(I)sim I^{-(M+2)}, provided there is no internal losses. This behaviour is in marked difference with the Rayleigh law {cal P}(I)sim exp(-I/I) which turns out to be valid only in the limit Mto infty. We also find the joint probability density of intensities I_1, ldots, I_L in L>1 observation points, and then extract the corresponding statistics for the maximal intensity in the observation pattern. For Lto infty the resulting limiting extreme value statistics (EVS) turns out to be different from the classical EVS distributions.
A Machine Learning Pipeline for Hunting Hidden Axion Signals in Pulsar Dispersion Measurements
In the axion model, electromagnetic waves interacting with axions induce frequency-dependent time delays, determined by the axion mass and decay constant. These small delays are difficult to detect, making traditional methods ineffective. To address this, we computed time delays for various parameters and found a prominent dispersion signal when the wave frequency equals half the axion mass. Based on this, we developed a machine learning-based pipeline, achieving 95\% classification accuracy and demonstrating strong detection capability in low signal-to-noise data. Applying this to PSR J1933-6211, we found no axion-induced delays within current sensitivity limits. While existing constraints are limited by atomic clock resolution in radio telescopes, future advances in optical clocks and broader bandwidths will enable more extensive searches. In particular, combining high-precision optical clocks with next-generation radio telescopes, such as the Qitai Radio Telescope, could improve decay constant constraints by four orders of magnitude for axion masses in the 10^{-6} sim 10^{-4} eV range.
Projections of Earth's Technosphere: Luminosity and Mass as Limits to Growth
Earth remains the only known example of a planet with technology, and future projections of Earth's trajectory provide a basis and motivation for approaching the search for extraterrestrial technospheres. Conventional approaches toward projecting Earth's technosphere include applications of the Kardashev scale, which suggest the possibility that energy-intensive civilizations may expand to harness the entire energy output available to their planet, host star, or even the entire galaxy. In this study, we argue that the Kardashev scale is better understood as a "luminosity limit" that describes the maximum capacity for a civilization to harvest luminous stellar energy across a given spatial domain, and we note that thermodynamic efficiency will always keep a luminosity-limited technosphere from actually reaching this theoretical limit. We suggest the possibility that an advanced technosphere might evolve beyond this luminosity limit to draw its energy directly from harvesting stellar mass, and we also discuss possible trajectories that could exist between Earth today and such hypothetical "stellivores." We develop a framework to describe trajectories for long-lived technospheres that optimize their growth strategies between exploration and exploitation, unlike Earth today. We note that analyses of compact accreting stars could provide ways to test the stellivore hypothesis, and we more broadly suggest an expansion of technosignature search strategies beyond those that reside exactly at the luminosity limit.
PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in diverse reasoning tasks, yet their application to complex physics reasoning remains underexplored. Physics reasoning presents unique challenges, requiring grounding in physical conditions and the interpretation of multimodal information. Current physics benchmarks are limited, often focusing on text-only inputs or solely on problem-solving, thereby overlooking the critical intermediate steps of variable identification and process formulation. To address these limitations, we introduce PhysicsArena, the first multimodal physics reasoning benchmark designed to holistically evaluate MLLMs across three critical dimensions: variable identification, physical process formulation, and solution derivation. PhysicsArena aims to provide a comprehensive platform for assessing and advancing the multimodal physics reasoning abilities of MLLMs.
Deep Learning and genetic algorithms for cosmological Bayesian inference speed-up
In this paper, we present a novel approach to accelerate the Bayesian inference process, focusing specifically on the nested sampling algorithms. Bayesian inference plays a crucial role in cosmological parameter estimation, providing a robust framework for extracting theoretical insights from observational data. However, its computational demands can be substantial, primarily due to the need for numerous likelihood function evaluations. Our proposed method utilizes the power of deep learning, employing feedforward neural networks to approximate the likelihood function dynamically during the Bayesian inference process. Unlike traditional approaches, our method trains neural networks on-the-fly using the current set of live points as training data, without the need for pre-training. This flexibility enables adaptation to various theoretical models and datasets. We perform simple hyperparameter optimization using genetic algorithms to suggest initial neural network architectures for learning each likelihood function. Once sufficient accuracy is achieved, the neural network replaces the original likelihood function. The implementation integrates with nested sampling algorithms and has been thoroughly evaluated using both simple cosmological dark energy models and diverse observational datasets. Additionally, we explore the potential of genetic algorithms for generating initial live points within nested sampling inference, opening up new avenues for enhancing the efficiency and effectiveness of Bayesian inference methods.
Phase Transitions in the Detection of Correlated Databases
We study the problem of detecting the correlation between two Gaussian databases XinR^{ntimes d} and Y^{ntimes d}, each composed of n users with d features. This problem is relevant in the analysis of social media, computational biology, etc. We formulate this as a hypothesis testing problem: under the null hypothesis, these two databases are statistically independent. Under the alternative, however, there exists an unknown permutation sigma over the set of n users (or, row permutation), such that X is rho-correlated with Y^sigma, a permuted version of Y. We determine sharp thresholds at which optimal testing exhibits a phase transition, depending on the asymptotic regime of n and d. Specifically, we prove that if rho^2dto0, as dtoinfty, then weak detection (performing slightly better than random guessing) is statistically impossible, irrespectively of the value of n. This compliments the performance of a simple test that thresholds the sum all entries of X^TY. Furthermore, when d is fixed, we prove that strong detection (vanishing error probability) is impossible for any rho<rho^star, where rho^star is an explicit function of d, while weak detection is again impossible as long as rho^2dto0. These results close significant gaps in current recent related studies.
Calibrated Language Models Must Hallucinate
Recent language models have a mysterious tendency to generate false but plausible-sounding text. Such "hallucinations" are an obstacle to the usability of language-based AI systems and can harm people who rely upon their outputs. This work shows shows that there is an inherent statistical reason that pretrained language models hallucinate certain types of facts, having nothing to do with the transformer LM architecture or data quality. For "arbitrary" facts whose veracity cannot be determined from the training data, we show that hallucination is necessary for language models that satisfy a statistical calibration condition appropriate for generative language models. Specifically, if the maximum probability of any fact is bounded, we show that the probability of generating a hallucination is close to the fraction of facts that occur exactly once in the training data (a "Good-Turing" estimate), even assuming ideal training data without errors. One conclusion is that models pretrained to be sufficiently good predictors (i.e., calibrated) may require post-training to mitigate hallucinations on the type of arbitrary facts that tend to appear once in the training set. However, our analysis also suggests that there is no statistical reason that pretraining will lead to hallucination on facts that tend to appear more than once in the training data (like references to publications such as articles and books, whose hallucinations have been particularly notable and problematic) or on systematic facts (like arithmetic calculations). Therefore, different architectures and learning algorithms may mitigate these latter types of hallucinations.
WALLABY Pilot Survey & ASymba: Comparing HI Detection Asymmetries to the SIMBA Simulation
An avenue for understanding cosmological galaxy formation is to compare morphometric parameters in observations and simulations of galaxy assembly. In this second paper of the ASymba: Asymmetries of HI in SIMBA Galaxies series, we measure atomic gas HI asymmetries in spatially-resolved detections from the untargetted WALLABY survey, and compare them to realizations of WALLABY-like mock samples from the SIMBA cosmological simulations. We develop a Scanline Tracing method to create mock galaxy HI datacubes which minimizes shot noise along the spectral dimension compared to particle-based methods, and therefore spurious asymmetry contributions. We compute 1D and 3D asymmetries for spatially-resolved WALLABY Pilot Survey detections, and find that the highest 3D asymmetries A3D>0.5 stem from interacting systems or detections with strong bridges or tails. We then construct a series of WALLABY-like mock realizations drawn from the SIMBA 50 Mpc simulation volume, and compare their asymmetry distributions. We find that the incidence of high A3D detections is higher in WALLABY than in the SIMBA mocks, but that difference is not statistically significant (p-value = 0.05). The statistical power of quantitative comparisons of asymmetries such as the one presented here will improve as the WALLABY survey progresses, and as simulation volumes and resolutions increase.
Forecasting Thermoacoustic Instabilities in Liquid Propellant Rocket Engines Using Multimodal Bayesian Deep Learning
The 100 MW cryogenic liquid oxygen/hydrogen multi-injector combustor BKD operated by the DLR Institute of Space Propulsion is a research platform that allows the study of thermoacoustic instabilities under realistic conditions, representative of small upper stage rocket engines. We use data from BKD experimental campaigns in which the static chamber pressure and fuel-oxidizer ratio are varied such that the first tangential mode of the combustor is excited under some conditions. We train an autoregressive Bayesian neural network model to forecast the amplitude of the dynamic pressure time series, inputting multiple sensor measurements (injector pressure/ temperature measurements, static chamber pressure, high-frequency dynamic pressure measurements, high-frequency OH* chemiluminescence measurements) and future flow rate control signals. The Bayesian nature of our algorithms allows us to work with a dataset whose size is restricted by the expense of each experimental run, without making overconfident extrapolations. We find that the networks are able to accurately forecast the evolution of the pressure amplitude and anticipate instability events on unseen experimental runs 500 milliseconds in advance. We compare the predictive accuracy of multiple models using different combinations of sensor inputs. We find that the high-frequency dynamic pressure signal is particularly informative. We also use the technique of integrated gradients to interpret the influence of different sensor inputs on the model prediction. The negative log-likelihood of data points in the test dataset indicates that predictive uncertainties are well-characterized by our Bayesian model and simulating a sensor failure event results as expected in a dramatic increase in the epistemic component of the uncertainty.
PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5\%, 42.2\%, and 45.8\% accuracy respectively-performance gaps exceeding 29\% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation.
Applications of Machine Learning to Lattice Quantum Field Theory
There is great potential to apply machine learning in the area of numerical lattice quantum field theory, but full exploitation of that potential will require new strategies. In this white paper for the Snowmass community planning process, we discuss the unique requirements of machine learning for lattice quantum field theory research and outline what is needed to enable exploration and deployment of this approach in the future.
Dynamical Model of J/Ψ photo-production on the nucleon
A dynamical model based on a phenomenological charm quark-nucleon(c-N) potential v_{cN} and the Pomeron-exchange mechanism is constructed to investigate the J/Psi photo-production on the nucleon from threshold to invariant mass W=300 GeV. The J/Psi-N potential,V_{J/Psi N}(r),is constructed by folding v_{cN} into the wavefunction Phi_{J/Psi}(cc) of J/Psi within a Constituent Quark Model(CQM) of Ref.[43]. A photo-production amplitude is also generated by v_{cN} by a cc-loop integration over the gammarightarrow cc vertex function and Phi_{J/Psi}(cc). No commonly used Vector Meson Dominance assumption is used to define this photo-production amplitude which is needed to describe the data near the threshold. The potential v_{cN}(r) is parameterized in a form such that the predicted V_{J/Psi N}(r) at large distances has the same Yukawa potential form extracted from a Lattice QCD(LQCD) calculation of Ref.[18]. The parameters of v_{cN} are determined by fitting the total cross section data of JLab by performing calculations that include J/Psi-N final state interactions(FSI). The resulting differential cross sections are found in good agreements with the data. It is shown that the FSI effects dominate the cross section in the very near threshold region, allowing for sensitive testing of the predicted J/Psi-N scattering amplitudes. By imposing the constraints of J/Psi-N potential extracted from the LQCD calculation, we have obtained three J/Psi-N potentials which fit the JLab data equally well. The resulting J/Psi-N scattering lengths are in the range of a=(-0.05 fm sim -0.25 fm). With the determined v_{cN}(r) and the wavefunctions generated from the same CQM, the constructed model is used to predict the cross sections of photo-production of eta_c(1S) and Psi(2S) mesons for future experimental tests.
I Don't Know: Explicit Modeling of Uncertainty with an [IDK] Token
Large Language Models are known to capture real-world knowledge, allowing them to excel in many downstream tasks. Despite recent advances, these models are still prone to what are commonly known as hallucinations, causing them to emit unwanted and factually incorrect text. In this work, we propose a novel calibration method that can be used to combat hallucinations. We add a special [IDK] ("I don't know") token to the model's vocabulary and introduce an objective function that shifts probability mass to the [IDK] token for incorrect predictions. This approach allows the model to express uncertainty in its output explicitly. We evaluate our proposed method across multiple model architectures and factual downstream tasks. We find that models trained with our method are able to express uncertainty in places where they would previously make mistakes while suffering only a small loss of encoded knowledge. We further perform extensive ablation studies of multiple variations of our approach and provide a detailed analysis of the precision-recall tradeoff of our method.
Large Language Models for History, Philosophy, and Sociology of Science: Interpretive Uses, Methodological Challenges, and Critical Perspectives
This paper explores the use of large language models (LLMs) as research tools in the history, philosophy, and sociology of science (HPSS). LLMs are remarkably effective at processing unstructured text and inferring meaning from context, offering new affordances that challenge long-standing divides between computational and interpretive methods. This raises both opportunities and challenges for HPSS, which emphasizes interpretive methodologies and understands meaning as context-dependent, ambiguous, and historically situated. We argue that HPSS is uniquely positioned not only to benefit from LLMs' capabilities but also to interrogate their epistemic assumptions and infrastructural implications. To this end, we first offer a concise primer on LLM architectures and training paradigms tailored to non-technical readers. We frame LLMs not as neutral tools but as epistemic infrastructures that encode assumptions about meaning, context, and similarity, conditioned by their training data, architecture, and patterns of use. We then examine how computational techniques enhanced by LLMs, such as structuring data, detecting patterns, and modeling dynamic processes, can be applied to support interpretive research in HPSS. Our analysis compares full-context and generative models, outlines strategies for domain and task adaptation (e.g., continued pretraining, fine-tuning, and retrieval-augmented generation), and evaluates their respective strengths and limitations for interpretive inquiry in HPSS. We conclude with four lessons for integrating LLMs into HPSS: (1) model selection involves interpretive trade-offs; (2) LLM literacy is foundational; (3) HPSS must define its own benchmarks and corpora; and (4) LLMs should enhance, not replace, interpretive methods.
Distinguishing Ignorance from Error in LLM Hallucinations
Large language models (LLMs) are susceptible to hallucinations-outputs that are ungrounded, factually incorrect, or inconsistent with prior generations. We focus on close-book Question Answering (CBQA), where previous work has not fully addressed the distinction between two possible kinds of hallucinations, namely, whether the model (1) does not hold the correct answer in its parameters or (2) answers incorrectly despite having the required knowledge. We argue that distinguishing these cases is crucial for detecting and mitigating hallucinations. Specifically, case (2) may be mitigated by intervening in the model's internal computation, as the knowledge resides within the model's parameters. In contrast, in case (1) there is no parametric knowledge to leverage for mitigation, so it should be addressed by resorting to an external knowledge source or abstaining. To help distinguish between the two cases, we introduce Wrong Answer despite having Correct Knowledge (WACK), an approach for constructing model-specific datasets for the second hallucination type. Our probing experiments indicate that the two kinds of hallucinations are represented differently in the model's inner states. Next, we show that datasets constructed using WACK exhibit variations across models, demonstrating that even when models share knowledge of certain facts, they still vary in the specific examples that lead to hallucinations. Finally, we show that training a probe on our WACK datasets leads to better hallucination detection of case (2) hallucinations than using the common generic one-size-fits-all datasets. The code is available at https://github.com/technion-cs-nlp/hallucination-mitigation .
Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis
With the exponential growth of research facilitated by modern technology and improved accessibility, scientific discoveries have become increasingly fragmented within and across fields. This makes it challenging to assess the significance, novelty, incremental findings, and equivalent ideas between related works, particularly those from different research communities. Large language models (LLMs) have recently demonstrated strong quantitative and qualitative reasoning abilities, and multi-agent LLM debates have shown promise in handling complex reasoning tasks by exploring diverse perspectives and reasoning paths. Inspired by this, we introduce Tree-of-Debate (ToD), a framework which converts scientific papers into LLM personas that debate their respective novelties. To emphasize structured, critical reasoning rather than focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling fine-grained analysis of independent novelty arguments within scholarly articles. Through experiments on scientific literature across various domains, evaluated by expert researchers, we demonstrate that ToD generates informative arguments, effectively contrasts papers, and supports researchers in their literature review.
Orbits and Dynamical Masses for Six Binary Systems in the Hyades Cluster
We report long baseline interferometric observations with the CHARA Array that resolve six previously known double-lined spectroscopic binary systems in the Hyades cluster, with orbital periods ranging from 3 to 358 days: HD 27483, HD 283882, HD 26874, HD 27149, HD 30676, and HD 28545. We combine those observations with new and existing radial-velocity measurements, to infer the dynamical masses for the components as well as the orbital parallaxes. For most stars the masses are determined to better than 1%. Our work significantly increases the number of systems with mass determinations in the cluster. We find that while current models of stellar evolution for the age and metallicity of the Hyades are able to reproduce the overall shape of the empirical mass-luminosity relation, they overestimate the V-band fluxes by about 0.1 mag between 0.5 and 1.4 M_{odot}. The disagreement is smaller in H, and near zero in K, and depends somewhat on the model. We also make use of the TESS light curves to estimate rotation periods for our targets, and detect numerous flares in one of them (HD 283882), estimating an average flaring rate of 0.44 events per day.
God(s) Know(s): Developmental and Cross-Cultural Patterns in Children Drawings
This paper introduces a novel approach to data analysis designed for the needs of specialists in psychology of religion. We detect developmental and cross-cultural patterns in children's drawings of God(s) and other supernatural agents. We develop methods to objectively evaluate our empirical observations of the drawings with respect to: (1) the gravity center, (2) the average intensities of the colors green and yellow, (3) the use of different colors (palette) and (4) the visual complexity of the drawings. We find statistically significant differences across ages and countries in the gravity centers and in the average intensities of colors. These findings support the hypotheses of the experts and raise new questions for further investigation.
Elliptical orbits in the phase-space quantization
The energy levels of hydrogen-like atoms are obtained from the phase-space quantization, one of the pillars of the old quantum theory, by three different methods - (i) direct integration, (ii) Sommerfeld's original method, and (iii) complex integration. The difficulties come from the imposition of elliptical orbits to the electron, resulting in a variable radial component of the linear momentum. Details of the calculation, which constitute a recurrent gap in textbooks that deal with phase-space quantization, are shown in depth in an accessible fashion for students of introductory quantum mechanics courses.
A differentiable binary microlensing model using adaptive contour integration method
We present microlux, which is a Jax-based code that can compute the binary microlensing light curve and its derivatives both efficiently and accurately. The key feature of microlux is the implementation of a modified version of the adaptive sampling algorithm that was originally proposed by V. Bozza to account for the finite-source effect most efficiently. The efficiency and accuracy of microlux have been verified across the relevant parameter space for binary microlensing. As a differentiable code, microlux makes it possible to apply gradient-based algorithms to the search and posterior estimation of the microlensing modeling. As an example, we use microlux to model a real microlensing event and infer the model posterior via both Fisher information matrix and Hamiltonian Monte Carlo, neither of which would have been possible without the access to accurate model gradients.
A search for extremely-high-energy neutrinos and first constraints on the ultra-high-energy cosmic-ray proton fraction with IceCube
We present a search for the diffuse extremely-high-energy neutrino flux using 12.6 years of IceCube data. The non-observation of neutrinos with energies well above 10 , PeV constrains the all-flavor neutrino flux at 10^{18} , eV to a level of E^2 Phi_{nu_e + nu_mu + nu_tau} simeq 10^{-8} , GeV , cm^{-2} , s^{-1} , sr^{-1}, the most stringent limit to date. Using this data, we constrain the proton fraction of ultra-high-energy cosmic rays (UHECRs) above simeq 30 , EeV to be lesssim 70,% (at 90,% CL) if the cosmological evolution of the sources is comparable to or stronger than the star formation rate. This result complements direct air-shower measurements by being insensitive to uncertainties associated with hadronic interaction models. It is the first such result to disfavor the ``proton-only" hypothesis for UHECRs using neutrino data.
A Reliable Knowledge Processing Framework for Combustion Science using Foundation Models
This research explores the integration of large language models (LLMs) into scientific data assimilation, focusing on combustion science as a case study. Leveraging foundational models integrated with Retrieval-Augmented Generation (RAG) framework, the study introduces an approach to process diverse combustion research data, spanning experimental studies, simulations, and literature. The multifaceted nature of combustion research emphasizes the critical role of knowledge processing in navigating and extracting valuable information from a vast and diverse pool of sources. The developed approach minimizes computational and economic expenses while optimizing data privacy and accuracy. It incorporates prompt engineering and offline open-source LLMs, offering user autonomy in selecting base models. The study provides a thorough examination of text segmentation strategies, conducts comparative studies between LLMs, and explores various optimized prompts to demonstrate the effectiveness of the framework. By incorporating an external database, the framework outperforms a conventional LLM in generating accurate responses and constructing robust arguments. Additionally, the study delves into the investigation of optimized prompt templates for the purpose of efficient extraction of scientific literature. The research addresses concerns related to hallucinations and false research articles by introducing a custom workflow developed with a detection algorithm to filter out inaccuracies. Despite identified areas for improvement, the framework consistently delivers accurate domain-specific responses with minimal human oversight. The prompt-agnostic approach introduced holds promise for future deliberations. The study underscores the significance of integrating LLMs and knowledge processing techniques in scientific research, providing a foundation for advancements in data assimilation and utilization.
The Compositional Structure of Bayesian Inference
Bayes' rule tells us how to invert a causal process in order to update our beliefs in light of new evidence. If the process is believed to have a complex compositional structure, we may observe that the inversion of the whole can be computed piecewise in terms of the component processes. We study the structure of this compositional rule, noting that it relates to the lens pattern in functional programming. Working in a suitably general axiomatic presentation of a category of Markov kernels, we see how we can think of Bayesian inversion as a particular instance of a state-dependent morphism in a fibred category. We discuss the compositional nature of this, formulated as a functor on the underlying category and explore how this can used for a more type-driven approach to statistical inference.
Verif.ai: Towards an Open-Source Scientific Generative Question-Answering System with Referenced and Verifiable Answers
In this paper, we present the current progress of the project Verif.ai, an open-source scientific generative question-answering system with referenced and verified answers. The components of the system are (1) an information retrieval system combining semantic and lexical search techniques over scientific papers (PubMed), (2) a fine-tuned generative model (Mistral 7B) taking top answers and generating answers with references to the papers from which the claim was derived, and (3) a verification engine that cross-checks the generated claim and the abstract or paper from which the claim was derived, verifying whether there may have been any hallucinations in generating the claim. We are reinforcing the generative model by providing the abstract in context, but in addition, an independent set of methods and models are verifying the answer and checking for hallucinations. Therefore, we believe that by using our method, we can make scientists more productive, while building trust in the use of generative language models in scientific environments, where hallucinations and misinformation cannot be tolerated.
Neural network emulator to constrain the high-z IGM thermal state from Lyman-α forest flux auto-correlation function
We present a neural network emulator to constrain the thermal parameters of the intergalactic medium (IGM) at 5.4z6.0 using the Lyman-displaystylealpha (Lydisplaystylealpha) forest flux auto-correlation function. Our auto-differentiable JAX-based framework accelerates the surrogate model generation process using approximately 100 sparsely sampled Nyx hydrodynamical simulations with varying combinations of thermal parameters, i.e., the temperature at mean density T_{{0}}, the slope of the temperaturedisplaystyle-density relation displaystylegamma, and the mean transmission flux langle{F}{rangle}. We show that this emulator has a typical accuracy of 1.0% across the specified redshift range. Bayesian inference of the IGM thermal parameters, incorporating emulator uncertainty propagation, is further expedited using NumPyro Hamiltonian Monte Carlo. We compare both the inference results and computational cost of our framework with the traditional nearest-neighbor interpolation approach applied to the same set of mock Lyalpha flux. By examining the credibility contours of the marginalized posteriors for T_{{0}},gamma,and{langle}{F}{rangle} obtained using the emulator, the statistical reliability of measurements is established through inference on 100 realistic mock data sets of the auto-correlation function.
Shaping Laser Pulses with Reinforcement Learning
High Power Laser (HPL) systems operate in the attoseconds regime -- the shortest timescale ever created by humanity. HPL systems are instrumental in high-energy physics, leveraging ultra-short impulse durations to yield extremely high intensities, which are essential for both practical applications and theoretical advancements in light-matter interactions. Traditionally, the parameters regulating HPL optical performance have been manually tuned by human experts, or optimized using black-box methods that can be computationally demanding. Critically, black box methods rely on stationarity assumptions overlooking complex dynamics in high-energy physics and day-to-day changes in real-world experimental settings, and thus need to be often restarted. Deep Reinforcement Learning (DRL) offers a promising alternative by enabling sequential decision making in non-static settings. This work explores the feasibility of applying DRL to HPL systems, extending the current research by (1) learning a control policy relying solely on non-destructive image observations obtained from readily available diagnostic devices, and (2) retaining performance when the underlying dynamics vary. We evaluate our method across various test dynamics, and observe that DRL effectively enables cross-domain adaptability, coping with dynamics' fluctuations while achieving 90\% of the target intensity in test environments.
Machine learning-driven Anomaly Detection and Forecasting for Euclid Space Telescope Operations
State-of-the-art space science missions increasingly rely on automation due to spacecraft complexity and the costs of human oversight. The high volume of data, including scientific and telemetry data, makes manual inspection challenging. Machine learning offers significant potential to meet these demands. The Euclid space telescope, in its survey phase since February 2024, exemplifies this shift. Euclid's success depends on accurate monitoring and interpretation of housekeeping telemetry and science-derived data. Thousands of telemetry parameters, monitored as time series, may or may not impact the quality of scientific data. These parameters have complex interdependencies, often due to physical relationships (e.g., proximity of temperature sensors). Optimising science operations requires careful anomaly detection and identification of hidden parameter states. Moreover, understanding the interactions between known anomalies and physical quantities is crucial yet complex, as related parameters may display anomalies with varied timing and intensity. We address these challenges by analysing temperature anomalies in Euclid's telemetry from February to August 2024, focusing on eleven temperature parameters and 35 covariates. We use a predictive XGBoost model to forecast temperatures based on historical values, detecting anomalies as deviations from predictions. A second XGBoost model predicts anomalies from covariates, capturing their relationships to temperature anomalies. We identify the top three anomalies per parameter and analyse their interactions with covariates using SHAP (Shapley Additive Explanations), enabling rapid, automated analysis of complex parameter relationships. Our method demonstrates how machine learning can enhance telemetry monitoring, offering scalable solutions for other missions with similar data challenges.
Shadow Cones: A Generalized Framework for Partial Order Embeddings
Hyperbolic space has proven to be well-suited for capturing hierarchical relations in data, such as trees and directed acyclic graphs. Prior work introduced the concept of entailment cones, which uses partial orders defined by nested cones in the Poincar\'e ball to model hierarchies. Here, we introduce the ``shadow cones" framework, a physics-inspired entailment cone construction. Specifically, we model partial orders as subset relations between shadows formed by a light source and opaque objects in hyperbolic space. The shadow cones framework generalizes entailment cones to a broad class of formulations and hyperbolic space models beyond the Poincar\'e ball. This results in clear advantages over existing constructions: for example, shadow cones possess better optimization properties over constructions limited to the Poincar\'e ball. Our experiments on datasets of various sizes and hierarchical structures show that shadow cones consistently and significantly outperform existing entailment cone constructions. These results indicate that shadow cones are an effective way to model partial orders in hyperbolic space, offering physically intuitive and novel insights about the nature of such structures.
Optical Spectroscopy of Classical Be Stars in Old Open Clusters
We performed the optical spectroscopy of 16 classical Be stars in 11 open clusters older than 100 Myr. Ours is the first spectroscopic study of classical Be stars in open clusters older than 100 Myr. We found that the H alpha emission strength of most of the stars is less than 40 Angstrom, in agreement with previous studies. Our analysis further suggests that one of the stars, KW97 35 12, might be a weak H alpha emitter in nature, showing H alpha equivalent width of negative 0.5 Angstrom. Interestingly, we also found that the newly detected classical Be star LS III 47 37b might be a component of the possible visual binary system LS III 47 37, where the other companion is also a classical Be star. Hence, the present study indicates the possible detection of a binary Be system. Moreover, it is observed that all 16 stars exhibit a lesser number of emission lines compared to classical Be stars younger than 100 Myr. Furthermore, the spectral type distribution analysis of B type and classical Be stars for the selected clusters points out that the existence of CBe stars can depend on the spectral type distribution of B type stars present in these clusters.