# Post-Execution Dataset for Characterizing Straggler Detection Mechanisms ## Description The data are meant to serve as an example dataset for users to play with the straggler detection characterizing metrics that we provide. ## Methodology * Testbed: 21 nodes on Grid'5000 testbed, a scientific testbed that permits to run experiments at large scales with highly-configurable infrastructure located across 10 sites in France. * Platform: Hadoop 2.7.3 was used for running our experiments. We configured Hadoop with one dedicated node as the Resource Manager, which is hosting the NameNode and the Application Manager processes. The rest 20 nodes were each running one DataNode process and one Node Manager process. Each node, hosting Node Manager process, was configured to run 8 Map tasks and 8 Reduce tasks maximum at a time. * Application: We chose WordCount, a simple yet representative MapReduce application amongst 13 different applications provided by the Puma benchmark. * Detection mechanisms: We selected three straggler detection mechanisms for examining in our experiments: Default, LATE and Hierarchical. * Environment heterogeneity: Besides the provided homogeneous environment, we tuned the hardware setting in order to introduce a heterogeneous environment. For our cluster of 20 workers, we divide them into four groups, G1, G2, G3, G4. Each group consists of a specific number of nodes. All nodes belonging to group Gi will have i active cores. For instance, the nodes in group G3 will all have 3 active cores. We vary the ratio of the four groups to present different scenarios covering a broad range of possible heterogeneous cluster setting. The data we present in this dataset include four settings, which are: {35-35-5-25}, {25-25-25-25}, {10-10-5-75} and {5-5-0-90}. * The data of each scenario are collected by repeating 5-10 running times. ## DataSet format * homo.txt: tasks' execution time in homogeneous environment. * Format: Col1-Runtime Col2-TaskID Col3-Execution Time * Col1-Runtime: The ith time the experiment was run. * Col2-TaskID: ID of the tasks. * Col3-Execution Time: The execution time of the corresponding task (in milliseconds) * scenario-a-b-c-d.txt: tasks' execution time in heterogeneous environment setting {a-b-c-d}. * Format: Col1-Runtime Col2-TaskID Col3-Execution Time Col4-Is detected Col5-Detection Strategy Col6-Detection Time * Col1-Runtime: The ith time the experiment was run. * Col2-TaskID: ID of the tasks. * Col3-Execution Time: The execution time of the corresponding task (in milliseconds) * Col4-Is detected: The value of the column indicates whether the corresponding task is detected as straggler or not. * Col5-Detection Strategy: The detection strategy used. * Col6-Detection time: This column specifies the duration of from the moment the task starts until it is detected as straggler. ## Plotting scripts We provide several GNUPLOT to help easily plotting the graphs. ## Data Analyser The R code within **./R/Analyser.r** file can help to extract the useful information from the provided dataset. Its output are *Precision*, *Recall*, *Response time*, *Undetected time* and *Fake positive*.