Feature Stories
Paul A. Navrátil
Scientific Workflows: Today and Beyond
A leadership perspective
TACC’s Strategic Technologies area is responsible for identifying the future computational needs for current and emerging scientific workflows, defining TACC strategies to support these workflows, and implementing these strategies alongside partners in the research community and in industry.
The team’s current work involves two emerging trends that will impact advanced computing over the next decade. Trend one: increasing machine complexity combined with decreasing user familiarity with and exposure to this complexity. Trend two: growth in distributed workflows where sensor networks and edge computing are combined with direct numerical simulations and deep learning to create real-time predictive and responsive capabilities.
High-level web-based science interfaces (Jupyter, RStudio, domain-specific portals) and virtualization platforms (Docker, Apptainer, CharlieCloud) hide hardware details from users at the same time that HPC systems are becoming increasingly layered and heterogeneous for processing, memory, and file input/output. Effective and efficient system use will require smart infrastructure to guide users to appropriate resources and to configure the execution environment intelligently.
In TACC’s early years, advanced computing often meant a direct numerical simulation of physical phenomena that ran across hundreds or even thousands of computers, each a “node” in a large interconnected machine. To achieve maximum performance at scale, the computational methods and data used in these simulations are carefully constructed to best use the specific processing and memory resources of the host machine.
Such simulations still dominate the number of node-hours consumed each year at TACC and are responsible for important results in atmospheric science, astrophysics, molecular physics, and more.
The past decade has seen a dramatic rise in the number of users and the types of science done at TACC. The concept of advanced computing has expanded beyond the traditional model where a user logs directly into the machine and uses the command line to submit a job to a batch scheduler that allocates the nodes and runs the simulation.
A majority of TACC users now interact with our systems using web portals and web-based services, which present system capabilities with more guidance and science-specific focus than a command-line prompt. These interfaces often simplify the user experience by hiding machine-specific details, and they may even hide the host machine itself through virtualization technologies.
Users are increasingly dependent on the choices made by the interface designer to provide reasonable default parameters for the underlying machine and to expose necessary machine details in an accessible and intelligible way. We use our experiences in domain-specific portals like DesignSafe-CI and general portals like the TACC Analysis Portal to improve the experience and productivity of both users and service providers.
Scientific workflows across disciplines look to incorporate diverse data flows from remote sensor networks and other real-time information as input for direct numerical simulations (DNS) and to train machine learning (ML) models, both for data analysis and to create surrogates of the simulations themselves. Some ML models will operate in traditional data centers like TACC for data analysis and model training; other ML models will be deployed via edge computing resources at or near the point of data collection for immediate data processing (like the Large Hadron Collider) or real-time data-driven operations (like traffic control for autonomous vehicles).
Both these uses require regular updates to the ML models for evolving data conditions. The training data needs to be captured, stored efficiently, and maintained to explain current model behavior and to inform the design of future models. Science domains that currently only use DNS also need to maintain sensor data, both for provenance and to support future ML model use.
We are currently working with academic and industry partners in environmental sensing and smart city automation to develop well-engineered, reusable frameworks for these types of workflows, leveraging proven TACC technologies like Tapis and Abaco.
These trends present exciting opportunities to reconsider traditional data center use models and expand the range of capabilities TACC offers.