Information on JFORCES Checkpoint /Hotstart Capability
Information - This is
important! Checkpoint/Hotstart Capability is NOT
installed by default, so if you want to use this capability you MUST
manually install it!
What is this the Checkpoint / Hotstart Capability?
The JFORCES Checkpoint / Hotstart capability provides the user with the ability to capture an image sanpshot (a.k.a. checkpoint) that can later be restarted from the saved location. Restarting from this mid-run condition is called hotstarting. This serves to provide the user with the ability to later restart the scenario from the saem location without waiting for the simulation to execute to that point, both saving time and avoiding rerun errors. Typical uses for this functionality include:
Disaster Recovery – When running a live exercise several checkpoints can be saved so that if the simulation should subsequently stop for any reason it can be restarted from an intrim point instead of forcing the reexecution of the simulation to the mid-scenario point
After Action Review with Alternative Analysis – When an exercise umpire sees a critical decision point he can take a checkpoint so the scenario can later be restarted from that point and the value of alternative courses of action can be explored
Debugging – If an error or inexplicable event occurs in a scenario at a predictable time a checkpoint can be made before that time so the programmer can quicly get to the appropriate point for debugging. Note that the restarted scenario can be reviewed with the gdb debugger, but any viewing limitations existent at the time of the checkpoint will remain (recompiling the simulation subsequent to the checkpoint will not change the hotstart execution.
The current method of invoking checkpointing is based upon the
Berkley Lab Checkpoint Restart (BLCR) capability. Since this method
changes the kernel it is NOT automatically installed during JFORCES
installation. Each user is required to determine (in conjunction
with appropriate IT and/security personel) whether it's acceptable to
load JFORCES on the system. All other JFORCES elements will work
without checkpointing. For more information about BLCR go to
Once installed on a JFORCES system it is not limited to checkpointing the simulation but can be used to checkpoint most processes. But for this description we'll focus on checkpointing the simulation since that is the typical use.
What are the Limitations?
The current implementation of checkpoint is based on the BLCR
functionality, which modified the Linux kernel to
incorporate checkpoint capabilities. This implies that this
implementation will not work on non-Linux machines (SUN, SGI or MS
Windows). What's more, this code is specific to the Linux kernel.
Currently the only working version on checkpoint works on the Fedora
Core 2 and 4 releases. In addition, the BLCR is veriy sensitive
to open channels, including TCP/IP and UDP/IP connections and
files. Today, JFORCES handles all of the following within
Programming work will be required to close and reopen any other
channels at checkpoint and restart. Programmers should put
changes in the Do_Checkpoint function of sim/exec/checkpoint.c.
Initial testing indicates that the checkpoint file can only be restarted on the same computer that the checkpoint was created on.
The hotstart will resume in exactly the same way that it was running when checkpointed with the following exceptions:
It is immediately paused it after startup to permit the MMI interfaces to hook up
The original communications servers are shut down and reopened. This means that any external interfaces, including MMIs and live feed sources, will need to be restarted and reattach to the simulation.
The data collection file is restored to the state it was in when the checkpoint occurred.
The restarting execution uses the same communications ports that were initially used; using alternate communications ports is not an option.
Since the checkpoint is an exact image of the simulation at a
given time, if the scenario is already running towards an
unacceptable situation (an error or scenario foopah) the checkpoint
will also run to the same conclusion unless the result can be
modified by either user or programmer (i.e. dbx insertion) between
the time of the checkpoint and the undesired event.
How is it Invoked?
There are five answers to this depending on whether you're interested in using the checkpoint manually, or at a prescripted point, or via the GUI and finally whether you're interested in checkpointing or restarting the simulation. This 3x2 matrix results in five viable options because restarting in a prescripted mode is not an option.
Runtime GUI Interfaces
Checkpointing – This is invoked from the MMI menu during a simulation execution by selecting “System Controls->Checkpoint”. No options are provided. The status of the checkpoint is displayed in the light blue info window on the main controls. I recognize that the messages might be too large for that window, but if you think you need to see the full text it's available for review by selecting “System Controls->Review Info Log”.
Note that there's often a significant (10+ seconds) simulation execution delay during a checkpoint. The checkpoint file will be stored in the $HOTSTART_DIR directory (usually /data/hotstarts) under a directory named according to your scenario (e.g. Fort_Hood) with a name based on the unix time the scenario was saved. You're permitted to make multiple checkpoints during the same run.
To restart from one of the checkpoints found under the HOTSTART_DIR
directory (usually /data/hotstarts), start the JFORCES interface and
select “System Controls->Start Scenario”. Then select
“Run Configuration and click the button labeled “Change
to Hotstart Mode” from the bottom of the displayed list. Then
select the appropriate hotstart file from the displayed list. Note
that when you do many of the original options on the “Start
Scenario” widget will disappear. This is because they're no
longer options but instead are predetermined by the checkpoint
itself. All hotstarts begin in interfactive mode. Just click
Execute to start the hotstart.
Checkpointing – Find the process ID of the process to be checkpointed and then issue the following command:
cr_checkpoint -f (checkpoint destination filename) (process id to be checkpointed)
Note that you're not restricted to checkpointing only the simulation in manual mode; many other processes can be checkpointed, though those using comm channels (e.g. The forces interface) or file output can not be restarted reliably.
Restarting – To restart manually from a checkpoint created manually just type the checkpoint filename in the a command window. If you're restarting the simulation you'll also need to put the data collection file (if there is one) in it's original location and you must create an empty file named /tmp/.resume_from_checkpoint. On non-simulation processes you're on your own restoring file output channels and interprocess communications channels.
You can prescript a checkpoint to occur during a run via the standard JFORCES interface. Go to “System Controls->DBA->Plan Run->Run Configuration->Maintain Run Configuration and select the run you wish to checkpoint. When the “Editing Plan Run” widget appears select “Replay Configuration Name” and select select “Replay Current Choice” from the presented list. Select “Preselect Time for Checkpoint” from the menu and specify the time.
Special Notes for the Prescripted Interface to Checkpointing:
You might want to create a new replay configuration instead via the Run Configuration interface instead of editing the current replay configuration if you're standard replay configuration is used by others (e.g. readdb).
Remember that after specifying that a checkpoint should be saved it will be saved EACH TIME you subsequently execute the run. This can result in a lot of files being saved over time. Typically you'll want to set up the run to save the execution once and then go back into the run configuration and turn this option off. If you find you've saved a lot of checkpoints you don't want you can delete them via the “system controls->DBA->Checkpoint Maintenance” interface.
Other Maintenance Information
Over time you'll want to delete old checkpoints. These can be removed via the “system controls->DBA->Checkpoint Maintenance” interface.
Information on Storage Structure
When saved in manual mode a checkpoint is simply a binary file containing enough information (excepting file handles and comm sockets) to restart from a given point. But the files stored from a JFORCES checkpoint have a bit more information. These files are actually a compressed tar of three files, namely:
The checkpoint file itself
The data collection file as it existed at the time of checkpointing (so data collection can be reliably resumed from the same point during hotstarts)
Any replay-related files. This includes files being saved
from the checkpointed execution and input replay files to the
checkpointed application. Note that both will be restored so it's
possible to save and reuse replay files from checkpointed executions.
These compressed tars are saved the $HOTSTART_DIR directory (as set up by your log in script, but typically /data/hotstarts). They are saved in subdirectories according to the scenrio name in files named according to the unix time the checkpoint was made.
Before describing installation, I should mention that
checkpoint/hotstart capabilities are NOT installed automatically.
This is because it makes non-standard (although simple) changes to the
Linux kernel, which caused some security officers concern. So
you're required to install it yourself in the hope you first clear it
with security and IT representatives.
The good news, is this functionality is easy to install.
To install, get the Berkley Labs Checkpoint / Restart
application. Today you can find this at
Version 0.4.2. is the version tested with Fedora Core 2 &
4. The standard install works through compilation.
Now that it's installed you need to load the modules. Today
this is done by creating an executable file named
/etc/rc.d/rc5.d/S97blcr with the following lines (with version and
library names changed as appropriate to your system):
Save this file, make sure the permissions are set to permit
execution, and test by typing the following:
Fix any problems (again, look at the blcr website for any
intallation changes). BTW - ignore any "file exists" errors;
these just mean these modules have already been loaded into the kernel.
If there are no obvious problems verify the modules have been loaded into the kernel by issuing the following command:
You should see something like:
You'll also want to make sure the JFORCES user can access the
utilities. A simple test for this is to log in as the JFORCES
user and type:
cr_checkpoint (carriage return)
If you get a usage statement you know that the JFORCES user can use
the checkpoint/restart capability.