Skip to content

Conversation

irenedea
Copy link
Collaborator

@irenedea irenedea commented Aug 8, 2025

  • creates a global mlflow run that composer will reuse (We can use this global run to log other metrics outside of the training run 😄 )
  • saves checkpoints to mlflow, so we can autoresume on non-interactive runs

After this PR, we should load the experience buffer from the checkpoints in order for checkpoints to work correctly with async. (Shouldn't be too hard..) It only works for sync right now.

https://dbc-559ffd80-2bfc.cloud.databricks.com/ml/experiments/723944411900647/runs/fcbceb3f3c9142539744a0883575ab0a/system-metrics?o=7395834863327820

image

You can see the metrics /system metrics for two iterations, where the second was a resumption. This was a super small dummy run, so the loss values seem to not show up when they repeat at 0.0... 🤷‍♀️

irenedea and others added 3 commits August 11, 2025 21:41
works

comment

rebased and updated

supporting absolute paths and dbfs

adding artifacts

moved to mlflow utils file

minor fix

slight adjustment
@rithwik-db rithwik-db force-pushed the irene/checkpoints-mlflow branch from 8880bc0 to cb5fc0b Compare August 11, 2025 21:42
Comment on lines +86 to +98
# NOTE: This doesn't work yet for a few reasons:
# 1. Downloading nested mlflow artifacts doesn't work correctly due to the MlflowObjectStore
# having issues. For instance, https://github.com/mosaicml/composer/blob/4ae29b1afec56ce2d54f6fa07a7f9578a0d364b0/composer/utils/object_store/mlflow_object_store.py#L465-L476
# requires `tmp_path = os.path.join(tmp_dir, os.path.basename(artifact_path))` instead of what it currently
# does. By doing that, the symlink can be loaded correctly.
# 2. If save_folder is an absolute path (e.g. /tmp/checkpoints), the symlink will be created using this
# absolute path. This is not a valid symlink in mlflow so we need to do some os.path gymnastics to
# support absolute paths for save_folder.
# 3. We also need to support save_folder being a dbfs path eventually.
# Proposed Approach
# - Create an MlflowCheckpointActor (allowing us to set WORLD_SIZE=1)
# and create functions within that are based on MlflowObjectStore.
# that safely handle dbfs paths and absolute paths.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is NOT ready for review since there's a lot of os.path gymnastics that we are doing for supporting saving things to mlflow artifacts. I am going to keep this PR on hold for now until we have time to think of a more resilient solution that addresses the problems here. (cc: @irenedea @bowenyang008)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants