ML4SPO: Machine Learning for Search Space Optimization

This internship aims at optimizing and understanding the interactions across applications, compiler, runtime, operating system, and hardware using machine learning.

Context

Compilation, runtime and hardware parameters affect both performance and energy. Such parameters are controlled by the developers and include the compilation passes, NUMA thread/process placement (conditioning communication) [3], prefetch, core/uncore frequency, cache occupancy, and memory bandwidth. The operating system, the runtime, and the compiler provide heuristics to guide the parameter selection but unfortunately, due to the complexity and diversity of systems and applications, many optimization opportunities are missed. Therefore, efficiently executing an application requires a large search space exploration of parameters.

To focus the scope of this study, we will consider a defined set of applications on HPC systems. Specific knobs control each parameter tuning. For example, compiler passes can directly be called during the compiler middle-end, thread placement is set by environment variable, or prefetchers can be enabled or disabled by writing values in specific registers. The student will work in an environment that enables changing parameters and evaluate them over a set of established applications.

Research goals

  • Identify whether an application is sensitive to some individual parameter, and to some interactions between combinations of parameters. In the short term, we will measure the parameter impact. In the long run, we are looking for applications characteristics that could leverage this information (e.g., code properites, performance counters). This is valuable to estimate if more costly optimization strategies are attractive. We will also investigate how different parameters interact with each other.

  • Discover efficient optimization parameters for a given application. The main challenge is the size of the space. We can rely on different search strategies to explore it. We consider random search, sampling, Genetic Algorithms, or even reinforcement learning.

  • Identify diverging strategies between optimizing an application for maximizing performance and for minimizing the energy consumption. Our intuition is that optimizing parameters more aggressively is more likely to result in gaps between the most performance efficient parameters and the most energy efficient parameters. Such gaps are valuable insights for the community as they identify both where to invest programming efforts or promising hardware design tradeoffs (e.g., more cache at lower frequency or less cache operating faster).

  • Design models to make optimization decisions. It is very resource consuming to explore large search spaces of parameters for each new application we want to optimize. We are therefore interested in building supervised models that can predict performance/energy potential improvement opportunities and means to achieve these opportunities. A key aspect is the study of the information (i.e., feature) that the model will use. Features candidates include static code embedding, performance counters, or execution traces.

Study material

We consider the following optimization space parameters that we can control:

  • thread placement
  • data placement
  • parallelism
  • NUMA effects
  • compiler passes
  • prefetch
  • frequency: core / uncore
  • Intel CAT: bandwidth and cache control

To evaluate these parameters and train models, we consider a set of benchmark suites that we have already gathered and set up. They include but are not limited to the OpenMP NAS (NASA parallel benchmarks), the Rodinia, and the Parsec benchmarks.

We also have access to previously collected traces reporting performance [3] and energy [1] measurements across NUMA thread, data, parallelism, and prefetch measurements. We can analyze and investigate this dataset to better understand the interaction across parameters.

Beginning

We will start by studying the prefetcher impact by investigating our dataset [1,3]. We will look at prefetch configurations that share the same performance: by considering the energy variation, we can better understand the behavior of the prefetcher, and in particular, detect situations were the use of the prefetchers does not bring any performance benefit but causes energy consumption overhead.

We will also use the dataset to extract a small group of configurations. We plan to co-execute them across other optimization spaces. In particular, we consider Intel CAT settings [4], the choice of processor frequency, or the use of SIMD instructions. This should enable us to better understand the interaction between the different parameters.

Study plan and goals

The student will work in close collaboration with Lana Scravaglieri, a joint Ph.D. student at IFPEN and Inria Bordeaux on search space exploration. The goal for the student is to develop their knowledge on runtime optimizations, data analysis and ML, as well as their writing and presentation skills. The internship is also an opportunity to observe how academic research is conducted. Depending on the research results, the student can also participate in the writing process to publish a research article.

The following tasks will be included in a framework that is developed by Lana Scravaglieri.

  1. Investigate dataset to sample behaviors or understand parameter interactions.
  2. Measure performance and energy of applications across a set of known parameters. This task both estimates their sensitivity to optimizations and discovers efficient parameters.
  3. Analyze codes. We consider static [2] or dynamic profiling (e.g., performance counters, reaction based profiling, code embedding). This task is useful to train models and understand the system application interaction.
  4. Train models to predict optimizations using code analysis or classify applications according to their parameters sensitivity.
  5. More aggressively explore optimizations (repeat 2-3-4).
  6. Discover takeaway for system designers and application developers.

Mots-clés:

Machine learning, Compiler optimization, Runtime thread and data placement

Pré-requis:

  • Motivation.
  • Curiosity and ability to learn new concepts.
  • Some experience with script languages (e.g., Python). This is necessary to set up large executions and analyze the data.
  • Basics of Linux. The project requires updating OS environment variables or registers to apply optimizations.
  • Knowledge of basic ML is a plus.

Contact:

mihail.popov@inria.fr lana.scravaglieri@inria.fr

Lieu du stage:

Équipe-Projet STORM, Centre Inria de l’université de Bordeaux.

Référence:

  1. Optimizing performance and energy across problem sizes through a search space exploration and machine learning
  2. Learning Intermediate Representations using Graph Neural Networks for NUMA and Prefetchers Optimization
  3. Modeling and Optimizing NUMA Effects and Prefetching with Machine Learning
  4. Introduction to Cache Allocation Technology in the Intel® Xeon® Processor E5 v4 Family