Postdoctoral Appointee - Large Scale Data Management and Storage for HPC/AI
Job posting number: #7083994
Posted: August 26, 2021
Application Deadline: Open Until Filled
Job DescriptionThe Exascale Computing Project (ECP) is working closely with large scale scientific applications that are increasingly being driven by scalable deep learning (e.g., CANDLE – Cancer Deep Learning Environment) running on the largest supercomputers in the world. In this context, we develop efficient techniques to capture, manipulate and persist large amounts of data in a consistent and resilient fashion (some of which are illustrated by the VELOC project, a low overhead checkpointing system). Currently, we are exploring a new data model centered around the notion of data states, which are intermediate representations of datasets automatically recorded into a lineage when tagged by applications with hints, constraints and persistency semantics. Such an approach enables the applications to focus on the meaning and properties of their data rather than how to access it, effectively reducing complexity while unlocking high performance and scalability for many use cases: finding and reusing previous intermediate results to explore alternatives, inspecting the evolution of datasets, verifying correctness, etc. This is especially important in the context of deep learning, where there is an acute need for advanced tools that explore many alternative DNN models and/or ensembles to improve accuracy, training speed and ability to generalize/explain a problem.
In addition to addressing such transformative challenges that arise at the intersection of HPC, big data analytics and machine learning, you will have the opportunity to work closely with many domain experts to identify the requirements and bottlenecks of real-life scientific applications that address the needs of our society over the next decades. In general, you will be part of a vibrant and diverse research community from more than 100 countries. Our lab hosts Aurora, one of the first Exascale supercomputers in the world, which you will have an opportunity to use for your experiments. In addition, you will have access to a large array of leading-edge experimental testbeds through the Joint Laboratory for System Evaluation (JLSE), which feature the latest technologies from top vendors like Intel, NVIDIA, AMD, etc.
Education and Experience Requirements
A recent or soon-to-be completed PhD degree
Familiarity with large scale deep learning techniques: data, model and pipeline parallelism. Ability to conduct interdisciplinary research at the intersection of HPC and deep learning and participate in teamwork and broad collaborative efforts involving other laboratories and universities, supercomputer centers and industry.
Scientific background in distributed computing and HPC in particular:
Strong code development skills with C/C++ and Python
Familiarity with modern data management and I/O best practices
Familiarity with machine/deep learning
Argonne is an equal opportunity employer, and we value diversity in our workforce. As an equal employment opportunity and affirmative action employer, Argonne National Laboratory is committed to a diverse and inclusive workplace that fosters collaborative scientific discovery and innovation. In support of this commitment, Argonne prohibits discrimination or harassment based on an individual's age, ancestry, citizenship status, color, disability, gender, gender identity, genetic information, marital status, national origin, pregnancy, race, religion, sexual orientation, veteran status or any other characteristic protected by law.