-
Notifications
You must be signed in to change notification settings - Fork 13
Create global mlflow run and use it for checkpoints #144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: single-controller-hackathon
Are you sure you want to change the base?
Conversation
3e87ee0
to
b46fc36
Compare
6d38c89
to
8880bc0
Compare
works comment rebased and updated supporting absolute paths and dbfs adding artifacts moved to mlflow utils file minor fix slight adjustment
8880bc0
to
cb5fc0b
Compare
# NOTE: This doesn't work yet for a few reasons: | ||
# 1. Downloading nested mlflow artifacts doesn't work correctly due to the MlflowObjectStore | ||
# having issues. For instance, https://github.com/mosaicml/composer/blob/4ae29b1afec56ce2d54f6fa07a7f9578a0d364b0/composer/utils/object_store/mlflow_object_store.py#L465-L476 | ||
# requires `tmp_path = os.path.join(tmp_dir, os.path.basename(artifact_path))` instead of what it currently | ||
# does. By doing that, the symlink can be loaded correctly. | ||
# 2. If save_folder is an absolute path (e.g. /tmp/checkpoints), the symlink will be created using this | ||
# absolute path. This is not a valid symlink in mlflow so we need to do some os.path gymnastics to | ||
# support absolute paths for save_folder. | ||
# 3. We also need to support save_folder being a dbfs path eventually. | ||
# Proposed Approach | ||
# - Create an MlflowCheckpointActor (allowing us to set WORLD_SIZE=1) | ||
# and create functions within that are based on MlflowObjectStore. | ||
# that safely handle dbfs paths and absolute paths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is NOT ready for review since there's a lot of os.path
gymnastics that we are doing for supporting saving things to mlflow artifacts. I am going to keep this PR on hold for now until we have time to think of a more resilient solution that addresses the problems here. (cc: @irenedea @bowenyang008)
After this PR, we should load the experience buffer from the checkpoints in order for checkpoints to work correctly with async. (Shouldn't be too hard..) It only works for sync right now.
https://dbc-559ffd80-2bfc.cloud.databricks.com/ml/experiments/723944411900647/runs/fcbceb3f3c9142539744a0883575ab0a/system-metrics?o=7395834863327820
You can see the metrics /system metrics for two iterations, where the second was a resumption. This was a super small dummy run, so the loss values seem to not show up when they repeat at 0.0... 🤷♀️