Information on JFORCES Checkpoint /Hotstart Capability

Quick Links

What is this the Checkpoint / Hotstart Capability?

The JFORCES Checkpoint / Hotstart capability provides the user with the ability to capture an image sanpshot (a.k.a. checkpoint) that can later be restarted from the saved location. Restarting from this mid-run condition is called hotstarting. This serves to provide the user with the ability to later restart the scenario from the saem location without waiting for the simulation to execute to that point, both saving time and avoiding rerun errors. Typical uses for this functionality include:

The current method of invoking checkpointing is based upon the Berkley Lab Checkpoint Restart (BLCR) capability. Since this method changes the kernel it is NOT automatically installed during JFORCES installation.  Each user is required to determine (in conjunction with appropriate IT and/security personel) whether it's acceptable to load JFORCES on the system.  All other JFORCES elements will work without checkpointing.  For more information about BLCR go to http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml.

Once installed on a JFORCES system it is not limited to checkpointing the simulation but can be used to checkpoint most processes. But for this description we'll focus on checkpointing the simulation since that is the typical use.

What are the Limitations?

The current implementation of checkpoint is based on the BLCR functionality, which modified the Linux kernel to incorporate checkpoint capabilities. This implies that this implementation will not work on non-Linux machines (SUN, SGI or MS Windows). What's more, this code is specific to the Linux kernel. Currently the only working version on checkpoint works on the Fedora Core 2 and 4 releases.  In addition, the BLCR is veriy sensitive to open channels, including TCP/IP and UDP/IP connections and files.  Today, JFORCES handles all of the following within checkpointing:

Programming work will be required to close and reopen any other channels at checkpoint and restart.  Programmers should put changes in the Do_Checkpoint function of sim/exec/checkpoint.c.

Initial testing indicates that the checkpoint file can only be restarted on the same computer that the checkpoint was created on.

The hotstart will resume in exactly the same way that it was running when checkpointed with the following exceptions:

The data collection file is restored to the state it was in when the checkpoint occurred.

The restarting execution uses the same communications ports that were initially used; using alternate communications ports is not an option.

Since the checkpoint is an exact image of the simulation at a given time, if the scenario is already running towards an unacceptable situation (an error or scenario foopah) the checkpoint will also run to the same conclusion unless the result can be modified by either user or programmer (i.e. dbx insertion) between the time of the checkpoint and the undesired event.


How is it Invoked?

There are five answers to this depending on whether you're interested in using the checkpoint manually, or at a prescripted point, or via the GUI and finally whether you're interested in checkpointing or restarting the simulation. This 3x2 matrix results in five viable options because restarting in a prescripted mode is not an option.

Runtime GUI Interfaces

Checkpointing – This is invoked from the MMI menu during a simulation execution by selecting “System Controls->Checkpoint”. No options are provided. The status of the checkpoint is displayed in the light blue info window on the main controls. I recognize that the messages might be too large for that window, but if you think you need to see the full text it's available for review by selecting “System Controls->Review Info Log”.

Note that there's often a significant (10+ seconds) simulation execution delay during a checkpoint. The checkpoint file will be stored in the $HOTSTART_DIR directory (usually /data/hotstarts) under a directory named according to your scenario (e.g. Fort_Hood) with a name based on the unix time the scenario was saved. You're permitted to make multiple checkpoints during the same run.

Restarting – To restart from one of the checkpoints found under the HOTSTART_DIR directory (usually /data/hotstarts), start the JFORCES interface and select “System Controls->Start Scenario”. Then select “Run Configuration and click the button labeled “Change to Hotstart Mode” from the bottom of the displayed list. Then select the appropriate hotstart file from the displayed list. Note that when you do many of the original options on the “Start Scenario” widget will disappear. This is because they're no longer options but instead are predetermined by the checkpoint itself. All hotstarts begin in interfactive mode. Just click Execute to start the hotstart.

Manual Invocation.

Checkpointing – Find the process ID of the process to be checkpointed and then issue the following command:

cr_checkpoint -f (checkpoint destination filename) (process id to be checkpointed)

Note that you're not restricted to checkpointing only the simulation in manual mode; many other processes can be checkpointed, though those using comm channels (e.g. The forces interface) or file output can not be restarted reliably.

Restarting – To restart manually from a checkpoint created manually just type the checkpoint filename in the a command window. If you're restarting the simulation you'll also need to put the data collection file (if there is one) in it's original location and you must create an empty file named /tmp/.resume_from_checkpoint. On non-simulation processes you're on your own restoring file output channels and interprocess communications channels.

Prescripted Interfaces

You can prescript a checkpoint to occur during a run via the standard JFORCES interface. Go to “System Controls->DBA->Plan Run->Run Configuration->Maintain Run Configuration and select the run you wish to checkpoint. When the “Editing Plan Run” widget appears select “Replay Configuration Name” and select select “Replay Current Choice” from the presented list. Select “Preselect Time for Checkpoint” from the menu and specify the time.

Special Notes for the Prescripted Interface to Checkpointing:

  1. You might want to create a new replay configuration instead via the Run Configuration interface instead of editing the current replay configuration if you're standard replay configuration is used by others (e.g. readdb).

  2. Remember that after specifying that a checkpoint should be saved it will be saved EACH TIME you subsequently execute the run. This can result in a lot of files being saved over time. Typically you'll want to set up the run to save the execution once and then go back into the run configuration and turn this option off. If you find you've saved a lot of checkpoints you don't want you can delete them via the “system controls->DBA->Checkpoint Maintenance” interface.

Other Maintenance Information

Over time you'll want to delete old checkpoints. These can be removed via the “system controls->DBA->Checkpoint Maintenance” interface.

Information on Storage Structure

When saved in manual mode a checkpoint is simply a binary file containing enough information (excepting file handles and comm sockets) to restart from a given point. But the files stored from a JFORCES checkpoint have a bit more information. These files are actually a compressed tar of three files, namely:

  1. The checkpoint file itself

  2. The data collection file as it existed at the time of checkpointing (so data collection can be reliably resumed from the same point during hotstarts)

  3. Any replay-related files.  This includes files being saved from the checkpointed execution and input replay files to the checkpointed application.  Note that both will be restored so it's possible to save and reuse replay files from checkpointed executions.

These compressed tars are saved the $HOTSTART_DIR directory (as set up by your log in script, but typically /data/hotstarts). They are saved in subdirectories according to the scenrio name in files named according to the unix time the checkpoint was made.

Installation Information

Before describing installation, I should mention that checkpoint/hotstart capabilities are NOT installed automatically.  This is because it makes non-standard (although simple) changes to the Linux kernel, which caused some security officers concern.  So you're required to install it yourself in the hope you first clear it with security and IT representatives.

The good news, is this functionality is easy to install.

To install, get the Berkley Labs Checkpoint / Restart application.  Today you can find this at http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml.  Version 0.4.2. is the version tested with Fedora Core 2 & 4.   The standard install works through compilation.  Namely :

  1. Log in as root.
  2. Expand the tarball (e.g. cd /tmp; tar -xzf blcr-0.4.2.tgz)
  3. move into the expanded source directory (e.g. cd /tmp/blcr-0.4.2)
  4. configure, make and install the code (e.g. ./configure --prefix=/usr; make; make install)
  5. Delete the working files (e.g. cd; rm -rf /tmp/blcr*)

Now that it's installed you need to load the modules.  Today this is done by creating an executable file named /etc/rc.d/rc5.d/S97blcr with the following lines (with version and library names changed as appropriate to your system):

/sbin/insmod /usr/local/lib/blcr/2.6.11-1.1369_FC4/blcr_imports.ko
/sbin/insmod /usr/local/lib/blcr/2.6.11-1.1369_FC4/blcr.ko
/sbin/insmod /usr/local/lib/blcr/2.6.11-1.1369_FC4/blcr_vmadump.ko

Save this file, make sure the permissions are set to permit execution, and test by typing the following:

/etc/rc.d/rc5.d/S97blcr

Fix any problems (again, look at the blcr website for any intallation changes).  BTW - ignore any "file exists" errors; these just mean these modules have already been loaded into the kernel.

 If there are no obvious problems verify the modules have been loaded into the kernel by issuing the following command:

/sbin/lsmod | grep blcr

You should see something like:

thunder:paul> /sbin/lsmod | grep blcr
blcr 56460 0
blcr_vmadump 25636 1 blcr
blcr_imports 6528 2 blcr,blcr_vmadump

You'll also want to make sure the JFORCES user can access the utilities.  A simple test for this is to log in as the JFORCES user and type:
cr_checkpoint (carriage return)

If you get a usage statement you know that the JFORCES user can use the checkpoint/restart capability.