Researchers have developed a technique that can automatically drastically speed up certain types of computer programs while ensuring that program results remain accurate.
Your system increases the speed of programs running in the Unix shell, a ubiquitous programming environment that was created 50 years ago and is still widely used today. Her method parallelizes these programs, which means that she breaks down program components into parts that can run simultaneously on multiple computer processors.
This allows programs to perform tasks such as web indexing, natural language processing, or data analysis in a fraction of their original runtime.
“There are so many people using these types of programs, including data scientists, biologists, engineers and economists. Now they can automatically speed up their programs without fear of getting incorrect results,” said Nikos Vasilakis, research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL).
The system also makes it easy for programmers who develop tools that data scientists, biologists, engineers and others use. They don’t need to make any special adjustments to their program commands to enable this automatic, error-free parallelization, adds Vasilakis, who chairs a committee of researchers from around the world who have been working on the system for almost two years.
Vasilakis is senior author of the group’s latest research paper, co-authored by Tammam Mustafa, MIT CSAIL student, to be presented at the USENIX Symposium on Operating Systems Design and Implementation. Co-authors include lead author Konstantinos Kallas, a graduate student at the University of Pennsylvania; Jan Bielak, a student at Warsaw’s Staszic Gymnasium; Dimitris Karnikis, software engineer at Aarno Labs; Thurston HY Dang, a former MIT postdoc who is now a software engineer at Google; and Michael Greenberg, assistant professor of computer science at Stevens Institute of Technology.
A decades-old problem
This new system, known as PaSh, focuses on programs or scripts running in the Unix shell. A script is a sequence of commands that instructs a computer to perform a calculation. Correct and automatic parallelization of shell scripts is a thorny problem that researchers have grappled with for decades.
The Unix shell remains popular because it is the only programming environment that allows assembling a script from functions written in multiple programming languages. Different programming languages are better suited for certain tasks or data types; When a developer uses the right language, solving a problem can be much easier.
“People also like to develop in different programming languages, so it’s very common to put all these components together in a single program,” adds Vasilakis.
While the Unix shell allows for multilingual scripts, its flexible and dynamic structure makes it difficult to parallelize these scripts using traditional methods.
Parallelizing a program is usually difficult because some parts of the program depend on others. This determines the order in which the components must be run; If the order is wrong, the program will fail.
When a program is written in a single language, developers have explicit information about its capabilities and the language that they can use to determine which components can be parallelized. But these tools don’t exist for scripts in the Unix shell. Users cannot easily see what is going on in the components or extract information that would help with parallelization.
A just-in-time solution
To solve this problem, PaSh uses a preprocessing step that inserts simple annotations into program components that it believes are parallelizable. Then PaSh tries to parallelize those parts of the script while the program is running, at the exact moment it reaches each component.
This avoids another problem with shell programming – it is impossible to predict a program’s behavior in advance.
The system avoids this problem by parallelizing program components “just in time”. It is capable of effectively accelerating many more components than traditional methods that try to do parallelization in advance.
Just-in-time parallelization also ensures that the accelerated program still delivers accurate results. If PaSh encounters a program component that cannot be parallelized (perhaps it depends on a component that hasn’t been run yet), it simply runs the original version and avoids an error.
“Regardless of the performance benefits — if you promise to get something up and running in a second instead of a year — if there’s a chance it will return incorrect results, no one will use your method,” says Vasilakis.
Users don’t need to make any changes to use PaSh; You can simply add the tool to your existing Unix shell and tell your scripts to use it.
acceleration and accuracy
Researchers tested PaSh on hundreds of scripts, from classic to modern programs, and not a single one broke. On average, the system was able to run programs six times faster compared to non-parallel scripts, and it achieved a maximum acceleration of almost 34 times.
It also increased the speed of scripts that other approaches could not parallelize.
“Our system is the first to show this type of fully correct transformation, but there is an indirect benefit as well. The way our system is designed allows other researchers and users in industry to build on this work,” says Vasilakis.
He looks forward to receiving additional feedback from users and seeing how they improve the system. The open-source project joined the Linux Foundation last year, making it widely available to users in industry and academia.
In the future, Vasilakis would like to use PaSh to address the distribution problem – partitioning a program to run on many computers rather than many processors within one computer. He also wants to improve the annotation scheme so that it is more user-friendly and better able to describe complex program components.
“Unix shell scripts play a key role in data analysis and software development tasks. These scripts could run faster by making the various programs they call use the multiple processing units available in modern CPUs. However, the dynamic nature of the shell makes this difficult
create parallel execution plans in advance,” says Diomidis Spinellis, professor of software development at the Athens University of Economics and Business and professor of software analysis at Delft University of Technology, who was not involved in this research. “Through just-in-time analysis, PaSh-JIT manages to overcome the dynamic complexity of the shell and thus reduce script execution times while maintaining the correctness of the corresponding results.”
“A drop-in replacement for an ordinary shell that orchestrates steps but doesn’t reorder or segment them, PaSh provides a hassle-free way to improve the performance of large data processing jobs,” adds Douglas McIlroy, associate professor in the Department of Computer Science at Dartmouth College, who previously headed the Computing Techniques Research Department at Bell Laboratories (the birthplace of the Unix operating system). “Hand optimization to exploit concurrency must be done at a level for which ordinary programming languages (including shells) do not provide clean abstractions. The resulting code mixes issues of logic and efficiency. It’s hard to read and difficult to maintain in the face of changing demands. PaSh skilfully intervenes at this level, preserving the original logic on the surface while achieving efficiency when the program is run.”
This work was supported in part by the Defense Advanced Research Projects Agency and the National Science Foundation.