This internship aims at optimizing and understanding the interactions across applications, compiler, runtime, operating system, and hardware using machine learning.
Compilation, runtime and hardware parameters affect both performance and energy. Such parameters are controlled by the developers and include the compilation passes, NUMA thread/process placement (conditioning communication) , prefetch, core/uncore frequency, cache occupancy, and memory bandwidth. The operating system, the runtime, and the compiler provide heuristics to guide the parameter selection but unfortunately, due to the complexity and diversity of systems and applications, many optimization opportunities are missed. Therefore, efficiently executing an application requires a large search space exploration of parameters.
To focus the scope of this study, we will consider a defined set of applications on HPC systems. Specific knobs control each parameter tuning. For example, compiler passes can directly be called during the compiler middle-end, thread placement is set by environment variable, or prefetchers can be enabled or disabled by writing values in specific registers. The student will work in an environment that enables changing parameters and evaluate them over a set of established applications.
Identify whether an application is sensitive to some individual parameter, and to some interactions between combinations of parameters. In the short term, we will measure the parameter impact. In the long run, we are looking for applications characteristics that could leverage this information (e.g., code properites, performance counters). This is valuable to estimate if more costly optimization strategies are attractive. We will also investigate how different parameters interact with each other.
Discover efficient optimization parameters for a given application. The main challenge is the size of the space. We can rely on different search strategies to explore it. We consider random search, sampling, Genetic Algorithms, or even reinforcement learning.
Identify diverging strategies between optimizing an application for maximizing performance and for minimizing the energy consumption. Our intuition is that optimizing parameters more aggressively is more likely to result in gaps between the most performance efficient parameters and the most energy efficient parameters. Such gaps are valuable insights for the community as they identify both where to invest programming efforts or promising hardware design tradeoffs (e.g., more cache at lower frequency or less cache operating faster).
Design models to make optimization decisions. It is very resource consuming to explore large search spaces of parameters for each new application we want to optimize. We are therefore interested in building supervised models that can predict performance/energy potential improvement opportunities and means to achieve these opportunities. A key aspect is the study of the information (i.e., feature) that the model will use. Features candidates include static code embedding, performance counters, or execution traces.
We consider the following optimization space parameters that we can control:
To evaluate these parameters and train models, we consider a set of benchmark suites that we have already gathered and set up. They include but are not limited to the OpenMP NAS (NASA parallel benchmarks), the Rodinia, and the Parsec benchmarks.
We also have access to previously collected traces reporting performance  and energy  measurements across NUMA thread, data, parallelism, and prefetch measurements. We can analyze and investigate this dataset to better understand the interaction across parameters.
We will start by studying the prefetcher impact by investigating our dataset [1,3]. We will look at prefetch configurations that share the same performance: by considering the energy variation, we can better understand the behavior of the prefetcher, and in particular, detect situations were the use of the prefetchers does not bring any performance benefit but causes energy consumption overhead.
We will also use the dataset to extract a small group of configurations. We plan to co-execute them across other optimization spaces. In particular, we consider Intel CAT settings , the choice of processor frequency, or the use of SIMD instructions. This should enable us to better understand the interaction between the different parameters.
The student will work in close collaboration with Lana Scravaglieri, a joint Ph.D. student at IFPEN and Inria Bordeaux on search space exploration. The goal for the student is to develop their knowledge on runtime optimizations, data analysis and ML, as well as their writing and presentation skills. The internship is also an opportunity to observe how academic research is conducted. Depending on the research results, the student can also participate in the writing process to publish a research article.
The following tasks will be included in a framework that is developed by Lana Scravaglieri.
Machine learning, Compiler optimization, Runtime thread and data placement
Équipe-Projet STORM, Centre Inria de l’université de Bordeaux.