pmtool: Post-Mortem Tool
Purpose: Analyse post-mortem the behavior of StarPU applications. Provide lower bounds on makespan. Study the performance of different schedulers in a simple context.
Limitations: ignore communications for the moment. Branch
comms
attempts to remove this limitation.
Compiling: This project uses cmake. Make a new build/ directory,
cd
to it, and use cmake ..
to generate Makefiles. Then
compile with make
.
All bounds require CPLEX, which can be specified with the
-DCPLEX_ROOT_DIR=<...>
option to cmake
.
Depends on librec-dev
(See here), and CPLEX depends on libgpg-error-dev and
libgcrypt-dev.
Table of contents
How to use:
Standard usage
pmtool [INPUT FILES] [OPTIONS]
where [OPTIONS]
are:
-
-a <algname>
,--alg=<algname>
Use algorithmalgname
to compute a schedule. Complete syntax isalgname:option1=value1:option2=value2:...
. A list of possible algorithms can be obtained with--help
. It is possible to give several-a
options to try several algorithms on all input files. It is possible to specifiy several values for one option withoption=value1,value2,value3
; then all combination of option values will be used. -
-b <boundname>
,--bound=<boundname>
Use boundname to compute a bound. Same option syntax as above. -
-o option=value
Apply this option to all subsequent algorithms and bounds -
-p <platformFile>
,--platform=<platformFile>
Add platformFile to the list of platform files to use. Need at least one if the instance does not contain execution times for tasks. -
-v <verb>
,--verbose=<verb>
Set the verbosity level to verb. Applies also to all algorithms and bounds specified after this option. Usually only useful for debugging. -
-s <savefile>
,--save=<savefile>
Save all the produced schedules in file savefile. -
-d <type>
,--default=<type>
Use default algorithms and bounds. This option comes in three different flavors (default =normal
):-
fast
Use low complexity algorithms, to be used on large instances. Usesarea
bound, and heft and HeteroPrio algorithms withrank=min
. -
normal
Use reasonable complexity algorithms:depbound
and heft and HeteroPrio algorithms withrank=min
andheft
. -
slow
Use slow algorithms for very precise results:depbound
andinterval
bounds, heft and HeteroPrio algorithms, and greedy which use the allocation from the interval bound.
-
-
--threads
,--threads=<nb>
Use parallel threads for algorithms (bounds are already parallel thanks to CPLEX). If unspecified, use the number of threads available on the machine.
See more specific options at the end of the document.
More details
Input files ending with .rec are read assuming librec format compatible with StarPU, others use an in-house format. Formats are specified here.
The output contains one header, then one line per file and per algorithm, where the header is
input algorithm option1 option2 mkspan time
Example:
./build/pmtool data/CholeskySirocco-*.pm -b area -a heft:rank=min -a dmdas:rank=heft -a hetprio:verbosity=1
input algorithm rank verbosity mkspan time
data/CholeskySirocco-12.pm area NA NA 125274 63
data/CholeskySirocco-12.pm heft min NA 179923 1
data/CholeskySirocco-12.pm dmdas heft NA 199231 2
data/CholeskySirocco-12.pm HetPrio NA 1 174862 3
data/CholeskySirocco-32.pm area NA NA 2.01504e+06 45
data/CholeskySirocco-32.pm heft min NA 2.27366e+06 40
data/CholeskySirocco-32.pm dmdas heft NA 2.19026e+06 44
data/CholeskySirocco-32.pm HetPrio NA 1 2.09165e+06 20
data/CholeskySirocco-64.pm area NA NA 1.59353e+07 67
data/CholeskySirocco-64.pm heft min NA 1.62474e+07 542
data/CholeskySirocco-64.pm dmdas heft NA 1.64605e+07 358
data/CholeskySirocco-64.pm HetPrio NA 1 1.59718e+07 147
Typical usage
- Obtain a trace file from a StarPU application
starpu_app starpu_fxt_tool -i /tmp/prof_file_ starpu_perfmodel_recdump tasks.rec-o platform_file.rec
2. Run `pmtool` on the trace
```
pmtool -p platform_file.rec tasks.rec -b area -a heft:rank=min,heft -a hetprio:rank=min,heft:verbosity=5
pmtool -p platform_file.rec tasks.rec -t 0.1 -b interval:pre=4:post=4:paths=yes:limit=10:share=ival:save=filename -a greedy:file=int@ival[alloc]:rank=alloc:stealer=1:export=file.rec
Algorithm description and options
Common options to all algorithms:
These options can be used for all algorithms.
- Boolean options have values
yes
orno
-
verbosity
default 0, is set automatically if the algorithm appears after a global-v
option -
save=<name>
provide a filename to save the schedule of this algorithm. Not to be used with several option values (option=value1,value2) since they will overwrite each other -
export
exports the allocation of the algorithm, in librec format -
bubble
additional mandatory option: bubbleType obscure option to make a specific export, divides schedule in bubbles depending on start times of tasks of type bubbleType, and outputs for each task the bubble in which it starts, in which it ends, and its allocation -
share=KEY
creates shares with keysKEY[alloc]
,KEY[worker]
,KEY[end]
andKEY[start]
which provide respectively, for each task, its allocation (which resource type it was assigned to), the worker it was assigned to, its start time and its end time, as computed by this algorithm. Shares can be reused by various algorithms, notably thegreedy
scheduler.
heft
A real HEFT algorithm. It considers each task in priority order, and schedule it ASAP given all previous decisions, maybe filling holes left by previously scheduled tasks.
Options
-
rank
(defaultmin
)
specifies the priority order. Higher values means more priority and thus executed earlier. Any order (exceptfifo
) can be prefixed with#
to reverse it. Possible values:-
fifo
tasks are ordered in the same order as they become ready -
none
use the order in which tasks are given in the input file -
min
,heft
,area
,unit
orders based on the task graph: priorities are equal to bottom levels of tasks according to a specific task weight:-
min
: smallest execution time accross all ressources -
heft
: average execution time accross all ressources -
area
: average execution time accross all ressources weighted by the proportion in the area bound for that task type. -
unit
: each task has weight 1.
-
-
ext@FILE
Use external priorities from a file (all values in order in which the tasks are given in the instance file) -
int@KEY
Use internal priorites, as computed from another algorithm or bound and shared with theshare
orexport
option -
lex@order1+order2+...+ordern
Combine several orderings in lexicographic order (can not befifo
)
-
list
A standard LIST algorithm (each time a resource is idle, assign the highest priority ready task to it).
Options
-
rank
same option as in theheft
algorithm
hetprio
HeteroPrio, as discussed in Suraj's thesis. Resource-centric algorithm, each time a resource is idle, assign the most suited task for it (highest priority in case of ties for fast resources, lowest priority for small resources). Fast resources are allowed to steal from other resources (and restart) if there is no available tasks. Stealing chooses the highest priority task that the stealer can finish before the victim.
Options
-
rank
same as for theheft
algorithm -
combFactor
(default 0)
combines several task types together (and consider them a tie) if their "suitability" is within combFactor -
front
(boolean, defaultno
)
small resources also pick highest priority tasks in case of ties. -
ssbaf
(boolean, defaultno
) -- sort steal by acceleration factor
Ifyes
, stealers choose the most "suited" task when stealing -
stealLatest
(boolean, defaultno
)
If yes, stealers choose the task that finishes last.ssbaf
has priority. -
stealidle
(boolean, defaultyes
)
Ifyes
, resources only steal when they're idle. Ifno
, if the best suited available task is less suited than a steal candidate, steal it. Should be paired withssbaf
to make sense.
True Hetero Prio
There is another implementation of heteroprio called truehp
, restricted to the case of 2 types of processors.
It is faster when there are many types of tasks (and thus few tasks of each type) because it
stores all tasks in one queue order by acceleration factor, whereas hetprio
maintains one queue
per type of task, ordered according to its rank
option. truehp
does not accept any option for now.
indep
Algorithm based on iteratively solving independent tasks scheduling. Based on a Greedy algorithm which maintains a list of ready tasks allocated to each resource, and schedules them in order of priority. Independent task solver is called to assign tasks to resources. All of these only work with two types of resources.
Options
-
style
(defaultonidle
)
main option that controls how often the independent task solver is called. Possible values:-
onidle
Solver is called each time at least one resource is idle -
onprogress
Solver is called when a given proportion of tasks in the last batch have been scheduled. Additional option:proportion
(default 0) -
strict
Solver is called when all tasks from the last batch have been completed. Only style not to have a rank option, tasks are ordered by higher execution time on their selected resource. Additional option: nosort to not sort them. -
level
Solver is called when all tasks from the last batch have been scheduled. Should be equivalent toonprogress:proportion=1
-
-
indep
(defaultdualhp
)
selects which algorithm to use. Values:-
dp2
dual approximation, dynamic-programming based algorithm. Based on the simpler algorithm from Scheduling Independent Tasks on Multi-cores with GPU Accelerators, described in Section 5.2. Also described in Scheduling Independent Tasks on Multi-cores with GPU Accelerators, Section 5.2. Additional option:disc
selects the discretization precision (default 3.0) -
dualhp
dual approximation, heteroprio-based greedy algorithm. Inspired from Scheduling Data Flow Program in XKaapi: A New Affinity Based Algorithm for Heterogeneous Architectures, with only the second part of the schedule. -
dp3demi
dual approximation, dynamic programming based algorithm. Based on APPROX-3/2 from Scheduling Independent Moldable Tasks on Multi-Cores with GPUs, but restricted to the non moldable case. Should also appear as a more generic (2q+1)/(2q) approximation in IJFCS. -
accel
Accel algorithm from Scheduling Independent Tasks on Multi-cores with GPU Accelerators, Section 4. -
imreh
Algorithm MG(alpha, gamma) from page 6 of (DOI) 10.1007/s00607-003-0011-9. -
balest
BalancedEstimate algorithm from this paper. Needs to usedosort=no
to obtain exactly the same behavior as in the paper. -
balmks
BalancedMakespan from the same paper asbalest
. Same comment. -
zhang
Online algorithm from Online Scheduling of Mixed Cpu-Gpu Jobs. Options:lambda
,beta
,theta
,phi
to change the parameters;sortby
(can benone
,accel
,avg
,min
) andrev
("reverse", boolean), to select how to sort the available tasks. -
round
Rounds the solution of the linear problem, as proposed in Scalable linear programming based resource allocation for makespan minimization in heterogeneous computing systems, (DOI) 10.1016/j.jpdc.2015.07.002. This algorithm sorts the tasks by LPT order, needs to usedosort=no
to keep this ordering. -
clb2c
Implements the CLB2C strategy (see Considerations on distributed load balancing for fully heterogeneous machines: Two particular cases.). Not exactly equivalent to this strategy, sinceindep
performs list scheduling based only on the assignment to different types of resources. -
minmin
Implements the MinMin strategy (see A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems.)
-
-
rank
(except for stylestrict
) as for the previous algorithms -
dosort
(except forstrict
, defaultyes
) ifyes
, applies the rank option to sort tasks after calling the independent task algorithm. Ifno
, leave them in the order chosen by the algorithm.
dmdas
Emulates the dmdas scheduler from StarPU. Considers ready tasks in priority order, and assign them on the resource on which they finish earliest, given tasks already assigned. High priority tasks can overcome already assigned low priority tasks, but resource selector does not know it.
Options
rank
greedy
Greedy algorithm with pre-allocated tasks. Requires a file
option
Options
-
rank
same as usual, with one additional possible value (which is the default):-
alloc
bottom-level priority computation (similar tomin
,heft
andarea
), where task weight is its execution time on its allocated resource.
-
-
file
(required)
Provide the name of a file from which the pre-allocation will be read, one number of resource type per task, in the order of the instance. CAREFUL of the-t
option, which renumbers resources. -
key
Uses pre-allocation shared with share or export option. OftenKEY[alloc]
actually. -
comms
Ifyes
, takes communication into account (if they are defined in thetasks.rec
input file). Default isno
. -
stealer
(default -1)
specifies a resource type which is allowed to steal from others, by number. CAREFUL of-t
option, which renumbers resources. -
stealerName
specifies a resource type which is allowed to steal, by name. Stealing only happens when idle, and steals the task with the best acceleration factor compared to the victim (highest priority in case of ties). If bothstealer
andstealerName
are specified,stealer
has precedence.
zhang
Online version of the indep zhang
algorithm. Less costly in execution time.
Shares a lot of the implementation with greedy
.
Options
Same options as greedy
, except that it does not have a file
option
and cannot use the alloc
ranking scheme. Additional options:
lambda
beta
theta
-
phi
Allows to change the values of the parameter of the algorithm. Default are as in the paper.
lp
Algorithm only meant for independent tasks, which solves a basic ILP to get an optimal assignment of tasks to machines.
Options
-
limit
specifies a time limit in seconds. -
gap
specifies a gap limit between the current solution and the upper bound.
Bound description and options
Reminder: all bounds require to compile with CPLEX.
Common options and concepts
-
option
verbosity
same as before. -
option
share=KEY
same as for algorithms, each bound can share some values, with the identifier KEY[PARAM]. Bounds share different parameters. -
hybridization
Almost all bounds are hybridized by default, in the following sense: Some tasks are taken away from the instance, at the beginning and end of the DAG. The shortest maximal path of these subdags are computed, and their lenghts are added to the result of the bound on the remaining DAG. This provides a bound which can be stronger than the boun on the whole DAG. Tasks are take away one after the other, until this does not improve the result.Hybridized bounds have the following options:
-
hybrid
(defaultyes
)
can be set tono
to disable hybridization -
limit
(default 0)
if non-zero, once the sets of tasks to take away are decided, the bound is computed in "tryhard" mode (usually looking for an integer solution) with the given time limit.
-
Possible bounds:
-
area
Simple area bound, with one variable per type of tasks and type of resource. Very quick. -
cp
Simple critical path bound. Very quick. -
dep
: mixed area-dependency bound. Always better thanmax(area, cp)
, but slower
Options:-
mode
:normal
(default) orconcurrent
if concurrent, uses Cplex's concurrent solving mode -
outint
(boolean, defaultno
) if present and non-zero time limit, outputs the computed integer solution
-
-
iterdep
: a refinment of the dep bound.
Once dep bound is computed, go around the graph to search for a pairsuch that the total work of taskssuch thatis too much to process betweenend(u)
andstart(v)
. Add a local area bound betweenand, and solve again. Empirically, costly to compute and not much gain to have.Options: same as dep
-
mixed
: Based on the paper by J. Langou
Idea: computes the time to the end for each task with thedep
bound. Then for all task;- compute the area bound of
Return best computed bound. Very slow, not strong enough to justify it.
- compute the area bound of
-
interval
: Interval Bound.
Based on dividing the schedule in intervals, and writing area bounds for each interval. Intervals are made along a critical path (of length) withintervals (cuts at start and end of each task of the critical path). One variable for each task, in each interval, on each resource, with dependency constraints and area constrains in each interval. There are start variables (like indep
bound), but they are only loosely linked to the intervals because it's hard to express.This is the only bound which is not hybridized, because it does not help.
Options:
-
weight
:min
(default) orheft
specifies how the critical path is computed for the intervals -
pre
andpost
(default -1)
If both are positive, this indicates that we keep only pre intervals at the beginning, and post at the end ; the middle intervals are merged together. Avoids a lot of computing. -
paths
(boolean, defaultyes
)
Ifyes
, add more constraints about the time each path gets allocated in each interval, to make sure that the interval is long enough. Only a limited amount of paths are used, ensuring that each task has a path from and to the critical path. -
share
can share[alloc]
(interval in which the task has done most of its computation),[med]
(for each task, in which interval it has been allocated half its work), and[avg]
(average interval in which task is performed)
-
More specific options
Here are some more specific options to pmtool
, not used very
often:
-
--best=<file>
Output the best makespan obtained. Iffile
is present, save the best schedule obtained for each instance infile
. -
--use-tags
Append the content of theTag
field ininstance.rec
input files to job ids. -
submit-order
Use theSubmitOrder
field ininstance.rec
input files instead ofJobId
. -
export-type
When exporting in.rec
format with theexport=
option of algorithms, specify all workers of this type instead of the particular worker. -
export-order
When exporting in.rec
format with theexport=
option of algorithms, specify the ordering of tasks on workers in addition to the allocation to workers. This is not compatible with--export-type
. -
-r <repartFile>
Output a summary of the repartition of task types on resource types in the repartFile. If there are only two types of resources, the standard output also indicates CPU and GPU usage. -
-w
,--output-raw
Raw algorithm names, instead of R-friendly column output. -
-t <tol>, --tolerance=<tol>
Tolerance for merging resources into the same type. If all execution times of two resources are within (1+tol) of each other, these resources are merged and the average execution times are used. -
--no-header
Does not output a header -
--no-dep
Ignore dependencies in input files. -
--rev-dep
Reverse dependencies. -
--no-convert
Does not convert indices in input. To be used if task IDs count from 0 to (n-1), instead of the default 1 to n.