11.9.3 Parallel execution in ABAQUS/Explicit

Products: ABAQUS/Explicit ABAQUS/CAE

References

Overview

Parallel execution in ABAQUS/Explicit:

reduces run time for analyses that require a large number of increments;
reduces run time for analyses that contain a large number of nodes and elements;
produces analysis results that are independent of the number of processors used for the analysis;
is available for shared memory computers using a thread-based loop level or thread-based domain decomposition implementation; and
is available for both shared memory computers and computer clusters using an MPI-based domain decomposition parallel implementation.

Invoking parallel processing

Parallelization in ABAQUS/Explicit is implemented in two ways: domain level and loop level. The domain-level method breaks the model up into topological domains and assigns each domain to a processor. The domain-level method is the default. The loop-level method parallelizes low-level loops that are responsible for most of the computational cost. The element, node, and contact pair operations account for the majority of the low-level parallelized routines.

Parallelization can be invoked by specifying the number of processors to be used.

Input File Usage:

Enter the following input on the command line:

abaqus job=job-name cpus=n

For example, the following input will run the job “beam” on two processors:

abaqus job=beam cpus=2

ABAQUS/CAE Usage:

Job module: job editor: Parallelization: toggle on Use multiple processors, and specify the number of processors, n

Domain-level parallelization

The domain-level method splits the model into a number of topological domains. These domains are referred to as parallel domains to distinguish them from other domains associated with the analysis. The domains are distributed evenly among the available processors. The analysis is then carried out independently in each domain. However, information must be passed between the domains in each increment because the domains share common boundaries. Both MPI and thread-based parallelization modes are supported with the domain-level method.

The domain-level method divides the model so that the resulting domains take approximately the same amount of computational expense. The load balance is defined as the ratio of the computational expense of the most expensive domain to that of the least expensive domain.

Element and node sets are created for each domain and can be inspected in ABAQUS/CAE. The sets are named domain_n, where n is the domain number.

During the analysis, separate state (job-name.abq) and selected results (job-name.sel) files are created. There will be one state and one selected results file for each processor. The naming convention is to append the processor number to the file name. For example, the state files are named job-name.abq.n, where n is the processor number. At the completion of the analysis the individual files are merged automatically into a single file (for example, job-name.abq), and the individual files are deleted.

Input File Usage: Enter the following input on the command line:

abaqus job=job-name cpus=n parallel=domain domains=n domains

For example, the following input will run the job “beam” on two processors with the domain-level parallelization method:

abaqus job=beam cpus=2 parallel=domain domains=2

The domain-level parallelization method can also be set in the environment file using the environment file parameters parallel=DOMAIN and domains.

ABAQUS/CAE Usage:

Job module: job editor: Parallelization: toggle on Use multiple processors, and specify the number of processors, n; Number of domains: n domains; Parallelization method: Domain.

Consistency of results

The analysis results are independent of the number of processors used for the analysis. However, the results do depend on the number of parallel domains used during the domain decomposition. Except for cases in which the single- and multiple-domain models are different due to features that are not yet available with multiple parallel domains (discussed below), these differences should be triggered only by finite precision effects. For example, the order of the nodal force assembly may depend on the number of parallel domains, which can result in differences in trailing digits in the computed force. Some physical systems are highly sensitive to small perturbations, so a tiny difference in the force applied in one increment can result in noticeable differences in results in subsequent increments. Simulations involving buckling and other bifurcations tend to be sensitive to small perturbations.

To obtain consistent analysis results from run to run, the number of domains used in the domain decomposition should be constant. Increasing the number of domains increases the computational cost slightly; therefore, it is recommended that the number of domains be set equal to the maximum number of processors used for analysis execution for optimal performance. If you do not specify the number of domains, the number defaults to the number of processors.

Features that do not allow domain-level parallelization

The use of the domain-level parallelization method is not allowed with the following features:

Extreme value output.
Steady-state detection.
Reading temperatures or field variables as predefined fields from a user-specified results file (see “Predefined fields,” Section 27.6.1). (Reading initial temperatures or field variables from a user-specified results file is allowed; see “Initial conditions,” Section 27.2.1.)

If these features are included, an error message will be issued.

Features that cannot be split across domains

Certain features cannot be split across domains. The domain decomposition algorithm automatically takes this into account and forces these features to be contained entirely within one domain. If fewer domains than requested processors are created, ABAQUS/Explicit issues an error message. Even if the algorithm succeeds in creating the requested number of domains, the load may be balanced unevenly. If this behavior is not acceptable, the job should be run with the loop-level parallelization method.

Adaptive smoothing domains cannot span parallel domain boundaries. Similarly, adaptive nodes on the surface of the adaptive smoothing domain will not be shared with another parallel domain. To enforce this in a consistent manner when parallel domains are specified, all nodes shared by adjacent adaptive smoothing domains will be set as nonadaptive. In this case the analysis results may be significantly different from that of a serial run with no parallel domains. Set the number of parallel domains to 1, and switch to the loop-level parallelization method if this behavior is undesirable. See “Defining ALE adaptive mesh domains in ABAQUS/Explicit,” Section 12.2.2, for details.

A contact pair cannot be split across parallel domains, but separate contact pairs are not restricted to be in the same parallel domain. A contact pair that uses the kinematic contact algorithm requires that all of the nodes associated with the involved surfaces be within a single parallel domain and not be shared with any other parallel domains. A contact pair that uses the penalty contact algorithm requires that the associated nodes be part of a single parallel domain, but these nodes may also be part of other parallel domains. Analyses in which a large percentage of nodes are involved in contact may not scale well if contact pairs are used, especially with kinematic enforcement of contact constraints. General contact does not limit the domain decomposition boundaries.

Nodes involved in kinematic constraints (“Kinematic constraints: overview,” Section 28.1.1) will be within a single parallel domain, and they will not be shared with another parallel domain. However, two kinematic constraints that do not share nodes can be placed within different parallel domains.

In some cases beam elements that share a node may be forced into the same parallel domain. This happens only for beams whose center of mass does not coincide with the location of the beam node or for beams with additional inertia (see “Adding inertia to the beam section behavior for Timoshenko beams” in “Beam section behavior,” Section 23.3.5).

Restart

There are certain restrictions for restart when using domain-level parallelization. To ensure that optimal parallel speedup is achieved, the number of processors used for the restart analysis must be chosen so that the number of parallel domains used during the original analysis can be distributed evenly among the processors. Because the domain decomposition is based only on the features specified in the original analysis and steps defined therein, features that affect domain decomposition are restricted from being defined in restart steps only if they would invalidate the original domain decomposition. Because the newly added features will be added to existing domains, there is a potential for load imbalance and a corresponding degradation of parallel performance.

The restart analysis requires that the separate state and selected results files created during the original analysis be converted into single files, as described in “Execution procedure for ABAQUS/Standard and ABAQUS/Explicit,” Section 3.2.2. This should be done automatically at the conclusion of the original analysis. If the original analysis fails to complete successfully, you must convert the state and selected results files prior to restart. An ABAQUS/Explicit analysis packaged to run with a domain-level parallelization technique cannot be restarted or continued with a loop-level parallelization technique.

Loop-level parallelization

The loop-level method parallelizes low-level loops in the code that are responsible for most of the computational cost. There are no restrictions when running with this method; however, the speedup factor may be significantly less than what can be achieved with domain-level parallelization. The speedup factor will vary depending on the features included in the analysis since not all features utilize parallel loops. Examples are the general contact algorithm and kinematic constraints. The loop-level method may scale poorly for more than four processors depending on the analysis. Using multiple parallel domains with this method will degrade parallel performance and, hence, is not recommended.

Analysis results for this method do not depend on the number of processors used.

Input File Usage: Enter the following input on the command line:

abaqus job=job-name cpus=n parallel=loop

The loop-level parallelization method can also be set in the environment file using the environment file parameter parallel=LOOP.

ABAQUS/CAE Usage:

Job module: job editor: Parallelization: toggle on Use multiple processors, and specify the number of processors, n; Parallelization method: Loop

Restart

There are no restrictions on features that can be included in steps defined in a restart analysis when using loop-level parallelization. For performance reasons the number of processors used when restarting must be a factor of the number of processors used in the original analysis. The most common case would be restarting with the same number of processors as used in the original analysis. An ABAQUS/Explicit analysis packaged to run with a loop-level parallelization technique cannot be restarted or continued with a domain-level parallelization technique.

Measuring parallel performance

Parallel performance is measured by comparing the total time required to run on a single processor (serial run) to the total time required to run on multiple processors (parallel run). This ratio is referred to as the speedup factor. The speedup factor will equal the number of processors used for the parallel run in the case of perfect parallelization. Scaling refers to the behavior of the speedup factor as the number of processors is increased. Perfect scaling indicates that the speedup factor increases linearly with the number of processors. For both parallelization methods the speedup factors and scaling behavior are heavily problem dependent. In general, the domain-level method will scale to a larger number of processors and offer the higher speedup factor.

Use with user subroutines

User subroutines can be used when running jobs in parallel. However, user subroutines and any subroutines called by them must be thread safe. This precludes the use of common blocks, data statements, and save statements. Calling subroutines that are not thread safe will result in unpredictable behavior of the executable.

Output

There are no output restrictions. However, for the domain-level parallelization method results are written to separate files for each processor. The individual files are merged automatically into a single file at the completion of the analysis, and the individual files are deleted. If the analysis does not complete successfully, you must convert the selected results and output database files prior to postprocessing (see “Execution procedure for ABAQUS/Standard and ABAQUS/Explicit,” Section 3.2.2).

Although the individual output database files can be postprocessed in ABAQUS/CAE, only results associated with domains on that processor will be available. See “Execution procedure for ABAQUS/Standard and ABAQUS/Explicit,” Section 3.2.2, for instructions on converting results prior to analysis completion.