Segfault while running the pt-scotch 7.0.0 test suite
Hello,
The dgord_1
test is segfaulting for me while running the pt-scotch 7.0.0 test suite, using Open MPI on a single node:
Start 123: prg_full_3
123/134 Test #123: prg_full_3 ................................ Passed 0.01 sec
Start 124: dgord_1
124/134 Test #124: dgord_1 ...................................***Failed 2.39 sec
[localhost:7661 :0:7669] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[localhost:7660 :0:7670] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[localhost:07660] *** Process received signal ***
[localhost:07660] Signal: Segmentation fault (11)
[localhost:07660] Signal code: Address not mapped (1)
[localhost:07660] Failing at address: (nil)
[localhost:07660] [ 0] /gnu/store/5h2w4qi9hk1qzzgi1w83220ydslinr4s-glibc-2.33/lib/libpthread.so.0(+0x11d80)[0x7ffff7fb7d80]
[localhost:07660] *** End of error message ***
==== backtrace (tid: 7669) ====
0 /gnu/store/pjbpikwv5n0gnkgha39cdx5rdcm757ng-ucx-1.9.0/lib/libucs.so.0(ucs_handle_error+0x254) [0x7ffff5570ff4]
1 /gnu/store/pjbpikwv5n0gnkgha39cdx5rdcm757ng-ucx-1.9.0/lib/libucs.so.0(+0x251af) [0x7ffff55711af]
2 /gnu/store/pjbpikwv5n0gnkgha39cdx5rdcm757ng-ucx-1.9.0/lib/libucs.so.0(+0x25476) [0x7ffff5571476]
=================================
[localhost:07661] *** Process received signal ***
[localhost:07661] Signal: Segmentation fault (11)
[localhost:07661] Signal code: (-6)
[localhost:07661] Failing at address: 0x3e700001ded
[localhost:07661] [ 0] /gnu/store/5h2w4qi9hk1qzzgi1w83220ydslinr4s-glibc-2.33/lib/libpthread.so.0(+0x11d80)[0x7ffff7fb7d80]
[localhost:07661] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node localhost exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Start 125: dgord_2
Occasionally the test would hang instead of crashing, apparently stuck in an MPI collective operation:
(gdb) bt
#0 0x00007ffff5e42cfa in ?? ()
from target:/gnu/store/9dqk7zm2ki57d27jn1dhqahjjr7xv7kz-openmpi-4.1.1/lib/openmpi/mca_btl_vader.so
#1 0x00007ffff79b97ec in opal_progress ()
from target:/gnu/store/9dqk7zm2ki57d27jn1dhqahjjr7xv7kz-openmpi-4.1.1/lib/libopen-pal.so.40
#2 0x00007ffff7d2e795 in ompi_request_default_wait ()
from target:/gnu/store/9dqk7zm2ki57d27jn1dhqahjjr7xv7kz-openmpi-4.1.1/lib/libmpi.so.40
#3 0x00007ffff7d8fc5a in ompi_coll_base_sendrecv_actual ()
from target:/gnu/store/9dqk7zm2ki57d27jn1dhqahjjr7xv7kz-openmpi-4.1.1/lib/libmpi.so.40
#4 0x00007ffff7d8dff0 in ompi_coll_base_allgather_intra_bruck ()
from target:/gnu/store/9dqk7zm2ki57d27jn1dhqahjjr7xv7kz-openmpi-4.1.1/lib/libmpi.so.40
#5 0x00007ffff53f1a8a in ompi_coll_tuned_allgather_intra_dec_fixed ()
from target:/gnu/store/9dqk7zm2ki57d27jn1dhqahjjr7xv7kz-openmpi-4.1.1/lib/openmpi/mca_coll_tuned.so
#6 0x00007ffff7d4131f in PMPI_Allgather ()
from target:/gnu/store/9dqk7zm2ki57d27jn1dhqahjjr7xv7kz-openmpi-4.1.1/lib/libmpi.so.40
#7 0x000000000042237b in _SCOTCHhdgraphInduceList (orggrafptr=0x7fffffffbc90, indlistnbr=<optimized out>,
indlisttab=0x625d5c, indgrafptr=indgrafptr@entry=0x7fffffffb6a0)
at /tmp/guix-build-pt-scotch-7.0.0.drv-0/source/src/libscotch/hdgraph_induce.c:188
#8 0x00000000004117ba in hdgraphOrderNdFold2 (fldthrdptr=fldthrdptr@entry=0x7fffffffb860)
at /tmp/guix-build-pt-scotch-7.0.0.drv-0/source/src/libscotch/hdgraph_order_nd.c:99
#9 0x0000000000412078 in hdgraphOrderNdFold3 (descptr=descptr@entry=0x7fffffffb7e0,
fldthrdtab=fldthrdtab@entry=0x7fffffffb860)
at /tmp/guix-build-pt-scotch-7.0.0.drv-0/source/src/libscotch/hdgraph_order_nd.c:131
#10 0x0000000000409b12 in _SCOTCHthreadLaunch (contptr=0x5e0e50,
funcptr=funcptr@entry=0x412050 <hdgraphOrderNdFold3>, paraptr=paraptr@entry=0x7fffffffb860)
at /tmp/guix-build-pt-scotch-7.0.0.drv-0/source/src/libscotch/common_thread.c:363
#11 0x0000000000411c94 in hdgraphOrderNdFold (fldgrafptr=0x7fffffffb9e0, indlisttab1=<optimized out>,
indlistnbr1=<optimized out>, indlisttab0=<optimized out>, indlistnbr0=<optimized out>, orggrafptr=0x7fffffffbc90)
at /tmp/guix-build-pt-scotch-7.0.0.drv-0/source/src/libscotch/hdgraph_order_nd.c:211
#12 _SCOTCHhdgraphOrderNd2 (grafptr=grafptr@entry=0x7fffffffbc90, cblkptr=cblkptr@entry=0x625be0,
paraptr=paraptr@entry=0x5e1908)
at /tmp/guix-build-pt-scotch-7.0.0.drv-0/source/src/libscotch/hdgraph_order_nd.c:395
#13 0x00000000004120de in _SCOTCHhdgraphOrderNd (grafptr=0x7fffffffbde0, cblkptr=0x625be0, paraptr=0x5e1908)
at /tmp/guix-build-pt-scotch-7.0.0.drv-0/source/src/libscotch/hdgraph_order_nd.c:473
#14 0x000000000040bd07 in SCOTCH_dgraphOrderComputeList (libgrafptr=libgrafptr@entry=0x7fffffffc0d0,
libordeptr=libordeptr@entry=0x7fffffffbfb0, listnbr=listnbr@entry=0, listtab=listtab@entry=0x0,
straptr=straptr@entry=0x7fffffffbf78)
at /tmp/guix-build-pt-scotch-7.0.0.drv-0/source/src/libscotch/library_dgraph_order.c:224
#15 0x000000000040bddc in SCOTCH_dgraphOrderCompute (grafptr=grafptr@entry=0x7fffffffc0d0,
ordeptr=ordeptr@entry=0x7fffffffbfb0, straptr=straptr@entry=0x7fffffffbf78)
at /tmp/guix-build-pt-scotch-7.0.0.drv-0/source/src/libscotch/library_dgraph_order.c:148
#16 0x0000000000403952 in main (argc=<optimized out>, argv=<optimized out>)
at /tmp/guix-build-pt-scotch-7.0.0.drv-0/source/src/scotch/dgord.c:334
This is with a "standard" Open MPI build. Should I instead using Open MPI with --enable-mpi-thread-multiple
? (This was apparently unnecessary with pt-scotch 6.x.)
Thanks in advance. :-)