Research Themes and Science Drivers

Teams of faculty have worked to identify the following research focuses and science drivers for CENSAI. These address major gaps in the current state-of-the-art in the AI research landscape and are intended to be representative, but not exhaustive, of Penn State capacity in foundational AI research.

Research Themes

Literature-based Discovery

Exponential growth in the scientific literature, coupled with increasing specialization in many areas of science, that are making it increasingly difficult for scientists to stay current in their own area, let alone, stay on top of findings in related areas. The result is creation of islands of knowledge that impede scientific progress. Literature-Based-Discovery (LBD) seeks to automate the discovery of new knowledge from existing literature, by extracting scientific claims along with relevant evidence across otherwise disparate literature, thereby facilitating discovery. The resulting claims and counterclaims, when encoded using modern scientific knowledge representation formalisms (e.g., Argumentation systems), can accelerate discovery across multiple scientific domains.

Scientific Knowledge Representation

Modern knowledge representation formalisms offer powerful means to describe the (i) relationships between and the processes that operate on the entities of interest in specific scientific domains, and (ii) scientific artifacts, measurements, data, hypotheses, experiments, evidence for or against hypotheses, etc., and hence the scientific process itself. Such computational abstractions allow us to view a scientific domain through the computational lens. We understand a phenomenon when we have an algorithmic, mathematical or empirical model that describes it at the desired level of abstraction. Effective use of computational lens to advance science necessarily requires mathematical or algorithmic abstractions of the relevant entities, relations, and processes that characterize a scientific domain(s) of interest. Once such abstractions are created, they become computational artifact that themselves can be analyzed, shared, and integrated with other related scientific artifacts, often leading to the rapid acceleration of science. Of particular interest here are techniques for representing, managing, and reasoning about, and validating scientific claims and counter-claims and their provenance e.g., using argumentation systems.

Model-based Machine Learning

Machine learning currently offers one of the most cost-effective approaches to constructing predictive models from data across many disciplines, e.g., life sciences, health sciences, social sciences, material sciences. However, predictive models that are produced by purely data-driven machine learning often violate physical or other forms of causal constraints, which in turn leads to grossly inaccurate predictions. Addressing this challenge requires encoding domain knowledge, e.g., physical principles, symmetries, constraints, measurement uncertainties, simulation results, etc., in a form that can be readily exploited by machine learning algorithms. However, this task is made challenging by the fact that there is a huge gap between the representation used by many of the modern machine learning formalisms, e.g., deep neural networks, and the natural representation of knowledge, e.g., partial differential equations, directed acyclic graphs, etc., that express the relationships between variables of interest, used in specific scientific domains. Bridging this language gap in ways that permit incorporating domain knowledge to guide data-driven machine learning so as to reduce the data requirements, improve the accuracy, reliability, and interpretability of machine learned predictive models presents opportunities for major methodological advances in AI.

Explaining Machine Learned Models

Predictive models constructed by modern machine learning methods, e.g., deep learning, are too complex for scientists to comprehend, and therefore to use to gain deep insights into the underlying phenomena, or recommend focused experiments. Hence, there is an urgent need for methods explaining the predictive models as well as their predictions. However, almost all existing approaches to explaining predictive models rely primarily on how a model’s inputs correlate with its outputs, and hence fail to generate reliable explanations of what the model has learned. Hence, there is much interest in explanations that are contrastive (Why did the model predict this and not that?) selective and context-sensitive (provide just the right amount of detail in the right context) and causal (Why did the model produce the output that it did?, i.e., What is the causal link between a model’s input and its output?).

Optimizing Experiments

Many scientific applications essentially can be cast as uncovering the input-output relationships of a black box using data about the input-output behavior of the black box. Purely data-driven approaches require large amounts of training data. However, assembling such training data presents a significant hurdle in practice because of the cost of measurement and characterization. Hence, there is much interest in methods that minimize the number of new materials that need to be experimentally tested. This presents enormous opportunities for advances in AI, e.g., in active learning, to optimize the design of experiments, using all available information, e.g., the cost of experiments, multiple quantitative or qualitative objectives and the trade-offs among them.

Integrative analyses of disparate observational and experimental data sets

Advances in many areas of science increasingly rely on our ability to integrate data from disparate observational and experimental studies using both knowledge-based, data-driven, approaches to integrative analyses of data. Of particular interest here are data integration methods that leverage modern representation learning techniques, knowledge graphs, statistical models and causal assumptions. Moreover, data are often sitting in different centralized repositories, or could be even extremely distributed on individualized/personal devices. Performing distributed-analyses or federated learning that goes beyond simple summary statistics, and supports interpretable and generalizable statistical and machine learning models is an area that poses theoretical, methodological, computational and infrastructural challenges for AI.

Human-AI Collaboration

Because accelerating science increasingly requires synergistic interplay between humans and AI, there is a need for machine as well as human understandable descriptions of scientific artifacts, organizational and social structures and scientific, social, and socio-technical processes, that facilitate collaborative science, including mechanisms for sharing data, experimental protocols, analysis tools, data and knowledge representations, abstractions, and visualizations, tasks, mental models, scientific workflows, mechanisms for decomposing tasks, assigning tasks, integrating results, incentivizing participants, and engaging human and AI participants with varying levels of expertise and ability in the scientific process.

Federated Data and Computational Infrastructure for Collaborative, Data-Intensive Science

Scientific progress in many disciplines is increasingly enabled by our ability to examine natural phenomena through the computational lens (e.g., using algorithmic abstractions of the underlying processes) and our ability to acquire, share, integrate, and analyze disparate types of data. However, realizing the full potential of data to accelerate scientific discoveries calls for significant advances in data and computational infrastructure to support collaborative data-intensive science by teams of researchers that transcend institutional and disciplinary boundaries. Therefore, there is an urgent need for a federated infrastructure that integrates the state-of-the-art AI and parallel computing tools, discoverable and shareable scientific workflows, data-intensive computing platforms, storage, and networking.

Motivated by these observations, we envision a robust infrastructure to have four complementary layers:

Physical layer that provides not only CPUs but also heterogeneous compute (e.g., GPUs, FPGAs, TPUs, and custom ASICs) and storage (e.g., various storage class memories including Intel Optane and different types of 3D SSDs) components. This physical infrastructure should leverage national/international and regional data repositories, existing investments in advanced cyberinfrastructure, like the NSF funded Big Data Regional Hubs, XSEDE, OSG, Eastern Regional Network, among others.
Domain-independent software support, providing the implementations of various machine learning and deep learning algorithms as well as parallel computing platforms such as Spark and Flint.
Domain-specific software support that facilitates collaborative development and integration of algorithmic abstractions of scientific domains e.g., biology, coupled with methods and tools for data analytics, modeling, and simulation, cognitive tools (representations, processes, protocols, workflows, software), reproducible, sharable, and reconfigurable data-intensive scientific workflows to advance collaborative science. This group also includes access to domain specific libraries and data repositories.
Cross-layer software support, including state-of-the-art visualization support, augmented with VR/AR, as well as data migration/transfer routines to enable data easily move from one platform to another. The infrastructure efforts leverage NSF-funded collaborations around Virtual Data Collaboratory (VDC), and Eastern Regional Network (ERN).

Science Drivers

Multiscale and Multi-omic Computational Structural Biology

Protein homeostasis concerns the study of how the levels of active protein are regulated in cells. A paradigm shift is occurring in this field in which protein structure and function is being recognized to be influenced the timing of various molecular processes that acted on them during translation, which is a shift from the Thermodynamic paradigm that assumes memoryless Markovian processes. Under this new paradigm, by understanding the timing of various events, the future state of a newly synthesized protein can be predicted. Coupled with the development of high-throughput experimental data, such as generated from ribosome profiling, a unique opportunity exists to utilize artificial intelligence to understand, model, and predict the mRNA and protein sequence features regulating protein levels, protein structure, and protein function.

This area thematically ties together Scientific Knowledge Representation, Model-based Machine Learning, Explaining Machine Learned Models, Optimizing Experiments, and Federated Data and Computational Infrastructure for Collaborative, Data-Intensive Science. Machine learning will help biophysical modeling by identifying the essential and predictive features of this complex phenomenon, which will form the basis for developing physical models that provide molecular insight. In turn, those physical models can be incorporated into a new round of machine learning models to extract information even more efficiently from experimental data. Finally, the insights gained from these approaches will identify new hypotheses and unexpected scenarios that can arise in these complex systems that can be tested in experiments.

In recognition of the fact that molecules are but one critical component of life, and that all biological molecules function and are understood within the context of organs and organisms, the unique multiscale and multi-omic aspects of this work would involve extension of the study of biological structure to cover multiple levels of biological context, including cells, tissues, and organisms, as well as cell-oriented biochemical and genetic studies including single cell transcriptomics, proteomics, and metabolomics that have strong, biologically relevant stochastic elements.

Computational Organismal and Tissue Phenomics

All living things are made of cells and function through intricate arrangements of those cells in 3 dimensions that are understood largely visually and qualitatively, but not quantitatively. The fundamental idea of turning “morphology into math”, in recognition that cells are fundamentally geometric structures, can be applied to four related, but independently fundable branches of phenomics: 1) Organismal phenomics (“Geometry of Life”), or the quantitative definition of the diversity of life, 2) Genetic Phenomics, focusing on the often multi-cell and multi-organ mutant phenotypes (pleiotropy) of systematic series of mutants in small (invertebrate such as flies and vertebrate such as zebrafish) multicellular organisms, one gene or noncoding control element at a time, 3) Chemical phenomics, focusing on toxicology, which can be divided into pharmacological phenomics and toxicological (including eco-toxicological) phenomics and 4) disease phenomics (“Geometry of Disease”), whose basis has historically been 2-dimensional imaging of tissue slices, or histology. With the emergence of digital pathology, there is an urgent need for infrastructure, methods, and tools to support analyses of cells and tissues, recognition of functional units such as secretary glands, lungs, livers, and tumors, and multi-scale modeling of such structures, as well their cross-species comparison. AI advances will be critical to image segmentation, reconstruction, clustering, classification, etc. In cancer, these phenomic characterizations will include tumor categorization, computation of tumor heterogeneity, mitotic index, morphological features important in tumor pathways, and in coordination with medical outcome data, improved personalization of treatment, and prognostics. These efforts will draw on “Scientific Knowledge Representation”, “Model Based Machine Learning”, “Explaining Machine Learned Models”, and “Federated Data and Computational Infrastructure for Collaborative, Data-Intensive Science” and “Human-AI Collaboration”).

Computational Health Sciences

There is increasing recognition that environmental and contextual factors can have a significant impact on the health outcomes in diseases such as cancer, obesity, diabetes, heart disease. The advent of big data offers enormous potential for understanding and predicting health risks, intervention outcomes, and personalized treatments, ultimately improving population health through integrative analysis of heterogeneous, fine-grained, richly structured, longitudinal patient data.

This project aims to bring together an interdisciplinary team of researchers to understand the clinical, behavioral, biomedical, environmental, and contextual (e.g., socio-demographic) factors that contribute to specific health risks or improved health outcomes. The project leverages existing investments in a repository of electronic health records data coded using a standardized vocabulary, an NSF-funded collaboration that has led to the establishment of a repository of longitudinal environmental exposures across the United States, as well as existing collaborations with the Center for Brain Research at IISc which provides access to data from an ongoing longitudinal study focused on cognitive declines associated with aging.

Collaborations are under way with clinical researchers at focusing on cancer care, public health sciences researchers focusing on “diseases of despair” impacting rural communities, etc. The key methodological and informatics innovations in the project have to do with the development of novel algorithms and tools for predictive modeling of health risks and health outcomes by integrating clinical, biomedical, environmental, socio-demographic and behavioral data, under the themes of “Model based Machine learning,” “Explaining Machine Learned Models”, and “Optimizing Experiments”, “Integrative Analyses of Disparate Observational and Experimental Data”, “Scientific Knowledge Representation” and “Federated Data and Computational Infrastructure for Collaborative, Data-Intensive Science.”

Computational Material Science

The demand for new materials for applications ranging from energy technologies (batteries, solar cells, energy harvesting technologies) to sensors, artificial organs and computing technologies (e.g., attojoule semiconductor electronics and the emerging area of quantum computers) far exceeds the capabilities of traditional materials discovery, design, characterization, and synthesis, and takes years to decades of effort. The Materials Genome Initiative (MGI) has produced substantial advances in theory, modeling, simulation, computing, algorithms, software, data analysis, and experimental techniques, and digital infrastructure for sharing of data, models and tools. Rapid advances in machine learning algorithms and software, together with increasing availability of materials data, offers the possibility of dramatically expediting the discovery, design, and synthesis of new materials.

Collaborations that are underway focus on prediction and explanation of materials properties, e.g., electronic bandgap, from material composition, thermodynamic stability from density functional theory (DFT) calculations, structural changes, chemical environments, etc., novel materials for synthesis and experimental validation, etc., with particular emphasis on materials needed for energy storage and harvesting, and novel memory storage and computing technologies. These efforts draw on methodological advances in “Literature-Based Discovery”, “Model based Machine learning,” “Explaining Machine Learned Models”, and “Optimizing Experiments”, “Integrative Analyses of Disparate Observational and Experimental Data”, “Scientific Knowledge Representation” and “Federated Data and Computational Infrastructure for Collaborative, Data-Intensive Science”.

Global Economic Stability

Economic collapses such as the Great Depression, the near collapse of global economic infrastructure in 2007, and the emerging impacts of the COVID-19 pandemic illustrate the real possibilities of economic instability whose causations include technology-driven changes in society, population growth, global climate change, and emerging infectious diseases. The advent of computer technology, including internet-accessible information have led to new forms of commerce and communication, and at the same time, new forms of financial transactions. Those advances can also be taken advantage of to create constantly updated databases of financial transactions that reflect stability and instability, as well as trends affecting income distribution, global stability, and the alignment of societal resources with the common good.

AI for Enhancing Social Discourse, Policy, and Action

Addressing major societal challenges requires transdisciplinary approaches that integrate basic science, human behavior, and the larger forces that influence attitudes that shape society. For example, in recognition of two determinants of phenotype, genes and environment, solving the problems of systematic racism depend upon a fundamental understanding of genetics, human migrations, modern human history, universality of group-centrism, the study of past and planned solutions to human conflict, an emphasis on fundamental human values and environmental influences (education and experience). A second example is the application of a systems approach to global climate change, including the systematic, collection of shared knowledge, educational materials, practical solutions, legislative and other action, and characterization and monitoring of change to climate and to downstream effects on our ecosystem. Such challenges offer unprecedented opportunities for AI technologies to mediate dialog and debate, supported by data and knowledge, and enhance participatory decision making. Of particular interest are “Scientific Knowledge Representation”, “Human-AI collaboration”, “Explaining Machine Learned Models”, among others.