From 2eaec88602ff6a02f202077d1d0c8c626538a21c Mon Sep 17 00:00:00 2001 From: Kevin Tyle Date: Mon, 21 Jul 2025 21:08:40 +0000 Subject: [PATCH 1/6] Update data access section --- portal/cookbook-guide.md | 26 ++++++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/portal/cookbook-guide.md b/portal/cookbook-guide.md index 86422c88d..d2db628b8 100644 --- a/portal/cookbook-guide.md +++ b/portal/cookbook-guide.md @@ -24,14 +24,28 @@ Using the Pythia Cookbook template to create reproducible documents housed elsew If you're not looking to create a _new_ Cookbook, but rather looking for guidance on contributing to _existing_ Cookbooks, first make sure you're comfortable with the [GitHub forking workflow](https://foundations.projectpythia.org/foundations/github/github-workflows.html#forking-workflow), then take a look at the section below on "Pull Requests and previews". -## A. Data Access +## A. Data access and storage considerations for Cookbooks -Before developing your cookbook, you should consider how it will access the data you plan to use. In loose order of preference, we recommend the following: +Pythia Cookbooks are typically powered by one or more geoscientific data sets to help illustrate a workflow or concept. The variety of formats and sources of Earth science data is huge. Here we provide general guidelines for helping choose data for your Cookbook, as well as options for storing and making data accessible if necessary. -1. Rely on data that is already freely available and accessible with tools in the ecosystem. Point to Foundations or other cookbooks for tool how-to guides if needed. Examples include the [CMIP6 Cookbook](https://projectpythia.org/cmip6-cookbook/) and the [CESM LENS on AWS Cookbook](https://projectpythia.org/cesm-lens-aws-cookbook/) -1. Focus on representative subsets of data that can be packaged alongside the cookbook in-repo. An example is the [Landsat ML Cookbook](https://projectpythia.org/landsat-ml-cookbook/README.html) -1. Discuss your larger data storage needs with the Pythia team. We are currently experimenting with cloud object storage for Cookbooks via NSF JetStream2. -1. Provide the tools and/or clear documentation for accessing the data that you have stored somewhere else +### Options for data pathways + +Cookbooks can most often succeed by relying on data that are publicly accessible, small, or otherwise self-contained. In order of preference, we recommend the following strategies for managing Cookbook data: + +1. **Remotely access open data**
For most Cookbooks that rely on data to demonstrate their concepts, we recommend accessing open, public datasets remotely in a sustainable way. Use tools like [Xarray](https://xarray.dev), [Siphon](https://www.unidata.ucar.edu/software/siphon), and [Intake](https://intake.readthedocs.io) to read data from providers such as [NOAA NCEI](https://www.ncei.noaa.gov), [AWS Open Data](https://registry.opendata.aws), [Google Cloud Public Datasets](https://cloud.google.com/datasets) and [Source Cooperative](https://source.coop), as long as such data are licensed and priced openly for public demonstration and use. Examples of existing Cookbooks that follow this preferred method include the [CMIP6 Cookbook](https://projectpythia.org/cmip6-cookbook/) and the [CESM LENS on AWS Cookbook](https://projectpythia.org/cesm-lens-aws-cookbook/). + +2. **Commit a small data artifact to your Cookbook repository** +If a few data files whose total size amount to less than 50MB can power your Cookbook, these can be directly stored in your `git` repository! *Make sure you have the license to provide such datasets*. An example Cookbook is the [Landsat ML Cookbook](https://projectpythia.org/landsat-ml-cookbook/README.html). *Note that the more files you commit and the larger they are, the more sluggish your Cookbook's notebooks will quickly become*. Exercise restraint! + +3. **Generate “toy” sample data in your Cookbook** +For many concepts, we encourage writing self-contained functions to generate simple representative datasets for demonstrating scientific concepts. Your Cookbook can even reuse these sample data repeatedly throughout. + +4. **For complex Cookbooks that rely on large datasets that are not already accessible through other services**, we suggest two options: +a. Institutional Repositories +Many universities, labs, and centers offer institutional repositories for storing data in a manner that makes it freely and readily available to the public. If you’re based at a university or a publicly funded research facility, check with your local library or data management office. If you are funded by NSF, you may be able to store your data on NSF NCAR’s [Research Data Archive (RDA)](https://rda.ucar.edu). +b. Project Pythia's [NSF Jetstream2](https://jetstream-cloud.org) Object Store +If you have created a larger dataset for your Cookbook and don’t have access to institutional resources of your own, Project Pythia may be able to provide a home on our +cloud object store. Our [Ocean Biogeochemistry Cookbook](https://projectpythia.org/ocean-bgc-cookbook/notebooks/readintutorial) uses this option. Please [contact the Project Pythia team](https://discourse.pangeo.io/c/education/project-pythia/60) if you would like to explore this option. ## B. Create a Repository From the Cookbook Template From feba08f20d0cd396a15fbb3940f232ddf3dabd24 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon, 21 Jul 2025 21:11:41 +0000 Subject: [PATCH 2/6] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- portal/cookbook-guide.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/portal/cookbook-guide.md b/portal/cookbook-guide.md index d2db628b8..b72d1baec 100644 --- a/portal/cookbook-guide.md +++ b/portal/cookbook-guide.md @@ -26,7 +26,7 @@ If you're not looking to create a _new_ Cookbook, but rather looking for guidanc ## A. Data access and storage considerations for Cookbooks -Pythia Cookbooks are typically powered by one or more geoscientific data sets to help illustrate a workflow or concept. The variety of formats and sources of Earth science data is huge. Here we provide general guidelines for helping choose data for your Cookbook, as well as options for storing and making data accessible if necessary. +Pythia Cookbooks are typically powered by one or more geoscientific data sets to help illustrate a workflow or concept. The variety of formats and sources of Earth science data is huge. Here we provide general guidelines for helping choose data for your Cookbook, as well as options for storing and making data accessible if necessary. ### Options for data pathways @@ -34,16 +34,16 @@ Cookbooks can most often succeed by relying on data that are publicly accessible 1. **Remotely access open data**
For most Cookbooks that rely on data to demonstrate their concepts, we recommend accessing open, public datasets remotely in a sustainable way. Use tools like [Xarray](https://xarray.dev), [Siphon](https://www.unidata.ucar.edu/software/siphon), and [Intake](https://intake.readthedocs.io) to read data from providers such as [NOAA NCEI](https://www.ncei.noaa.gov), [AWS Open Data](https://registry.opendata.aws), [Google Cloud Public Datasets](https://cloud.google.com/datasets) and [Source Cooperative](https://source.coop), as long as such data are licensed and priced openly for public demonstration and use. Examples of existing Cookbooks that follow this preferred method include the [CMIP6 Cookbook](https://projectpythia.org/cmip6-cookbook/) and the [CESM LENS on AWS Cookbook](https://projectpythia.org/cesm-lens-aws-cookbook/). -2. **Commit a small data artifact to your Cookbook repository** +2. **Commit a small data artifact to your Cookbook repository** If a few data files whose total size amount to less than 50MB can power your Cookbook, these can be directly stored in your `git` repository! *Make sure you have the license to provide such datasets*. An example Cookbook is the [Landsat ML Cookbook](https://projectpythia.org/landsat-ml-cookbook/README.html). *Note that the more files you commit and the larger they are, the more sluggish your Cookbook's notebooks will quickly become*. Exercise restraint! -3. **Generate “toy” sample data in your Cookbook** +3. **Generate “toy” sample data in your Cookbook** For many concepts, we encourage writing self-contained functions to generate simple representative datasets for demonstrating scientific concepts. Your Cookbook can even reuse these sample data repeatedly throughout. -4. **For complex Cookbooks that rely on large datasets that are not already accessible through other services**, we suggest two options: -a. Institutional Repositories -Many universities, labs, and centers offer institutional repositories for storing data in a manner that makes it freely and readily available to the public. If you’re based at a university or a publicly funded research facility, check with your local library or data management office. If you are funded by NSF, you may be able to store your data on NSF NCAR’s [Research Data Archive (RDA)](https://rda.ucar.edu). -b. Project Pythia's [NSF Jetstream2](https://jetstream-cloud.org) Object Store +4. **For complex Cookbooks that rely on large datasets that are not already accessible through other services**, we suggest two options: +a. Institutional Repositories +Many universities, labs, and centers offer institutional repositories for storing data in a manner that makes it freely and readily available to the public. If you’re based at a university or a publicly funded research facility, check with your local library or data management office. If you are funded by NSF, you may be able to store your data on NSF NCAR’s [Research Data Archive (RDA)](https://rda.ucar.edu). +b. Project Pythia's [NSF Jetstream2](https://jetstream-cloud.org) Object Store If you have created a larger dataset for your Cookbook and don’t have access to institutional resources of your own, Project Pythia may be able to provide a home on our cloud object store. Our [Ocean Biogeochemistry Cookbook](https://projectpythia.org/ocean-bgc-cookbook/notebooks/readintutorial) uses this option. Please [contact the Project Pythia team](https://discourse.pangeo.io/c/education/project-pythia/60) if you would like to explore this option. From 28539db174a4a50102a7220e0cf4424513319632 Mon Sep 17 00:00:00 2001 From: Kevin Tyle Date: Tue, 22 Jul 2025 09:10:06 -0400 Subject: [PATCH 3/6] Update portal/cookbook-guide.md BR suggested change 1 Co-authored-by: Brian Rose --- portal/cookbook-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/portal/cookbook-guide.md b/portal/cookbook-guide.md index b72d1baec..0bcf8e564 100644 --- a/portal/cookbook-guide.md +++ b/portal/cookbook-guide.md @@ -35,7 +35,7 @@ Cookbooks can most often succeed by relying on data that are publicly accessible 1. **Remotely access open data**
For most Cookbooks that rely on data to demonstrate their concepts, we recommend accessing open, public datasets remotely in a sustainable way. Use tools like [Xarray](https://xarray.dev), [Siphon](https://www.unidata.ucar.edu/software/siphon), and [Intake](https://intake.readthedocs.io) to read data from providers such as [NOAA NCEI](https://www.ncei.noaa.gov), [AWS Open Data](https://registry.opendata.aws), [Google Cloud Public Datasets](https://cloud.google.com/datasets) and [Source Cooperative](https://source.coop), as long as such data are licensed and priced openly for public demonstration and use. Examples of existing Cookbooks that follow this preferred method include the [CMIP6 Cookbook](https://projectpythia.org/cmip6-cookbook/) and the [CESM LENS on AWS Cookbook](https://projectpythia.org/cesm-lens-aws-cookbook/). 2. **Commit a small data artifact to your Cookbook repository** -If a few data files whose total size amount to less than 50MB can power your Cookbook, these can be directly stored in your `git` repository! *Make sure you have the license to provide such datasets*. An example Cookbook is the [Landsat ML Cookbook](https://projectpythia.org/landsat-ml-cookbook/README.html). *Note that the more files you commit and the larger they are, the more sluggish your Cookbook's notebooks will quickly become*. Exercise restraint! +If a few data files whose total size amount to less than 50MB can power your Cookbook, these can be directly stored in your `git` repository! *Make sure you have the license to provide such datasets*. An example Cookbook is the [Landsat ML Cookbook](https://projectpythia.org/landsat-ml-cookbook/). *Note that the more files you commit and the larger they are, the more sluggish your Cookbook's notebooks will quickly become*. Exercise restraint! 3. **Generate “toy” sample data in your Cookbook** For many concepts, we encourage writing self-contained functions to generate simple representative datasets for demonstrating scientific concepts. Your Cookbook can even reuse these sample data repeatedly throughout. From 4445d45405e0a28beb1c42aa234715bf6140d48e Mon Sep 17 00:00:00 2001 From: Kevin Tyle Date: Tue, 22 Jul 2025 09:10:42 -0400 Subject: [PATCH 4/6] Update portal/cookbook-guide.md BR suggested change 2 Co-authored-by: Brian Rose --- portal/cookbook-guide.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/portal/cookbook-guide.md b/portal/cookbook-guide.md index 0bcf8e564..12d62286f 100644 --- a/portal/cookbook-guide.md +++ b/portal/cookbook-guide.md @@ -41,11 +41,11 @@ If a few data files whose total size amount to less than 50MB can power your Coo For many concepts, we encourage writing self-contained functions to generate simple representative datasets for demonstrating scientific concepts. Your Cookbook can even reuse these sample data repeatedly throughout. 4. **For complex Cookbooks that rely on large datasets that are not already accessible through other services**, we suggest two options: -a. Institutional Repositories -Many universities, labs, and centers offer institutional repositories for storing data in a manner that makes it freely and readily available to the public. If you’re based at a university or a publicly funded research facility, check with your local library or data management office. If you are funded by NSF, you may be able to store your data on NSF NCAR’s [Research Data Archive (RDA)](https://rda.ucar.edu). -b. Project Pythia's [NSF Jetstream2](https://jetstream-cloud.org) Object Store -If you have created a larger dataset for your Cookbook and don’t have access to institutional resources of your own, Project Pythia may be able to provide a home on our -cloud object store. Our [Ocean Biogeochemistry Cookbook](https://projectpythia.org/ocean-bgc-cookbook/notebooks/readintutorial) uses this option. Please [contact the Project Pythia team](https://discourse.pangeo.io/c/education/project-pythia/60) if you would like to explore this option. +Institutional Repositories +: Many universities, labs, and centers offer institutional repositories for storing data in a manner that makes it freely and readily available to the public. If you’re based at a university or a publicly funded research facility, check with your local library or data management office. If you are funded by NSF, you may be able to store your data on NSF NCAR’s [Research Data Archive (RDA)](https://rda.ucar.edu). + +Project Pythia's [NSF Jetstream2](https://jetstream-cloud.org) Object Store +: If you have created a larger dataset for your Cookbook and don’t have access to institutional resources of your own, Project Pythia may be able to provide a home on our cloud object store. Our [Ocean Biogeochemistry Cookbook](https://projectpythia.org/ocean-bgc-cookbook/notebooks/readintutorial) uses this option. Please [contact the Project Pythia team](https://discourse.pangeo.io/c/education/project-pythia/60) if you would like to explore this option. ## B. Create a Repository From the Cookbook Template From 8271d6a52c1fbe734ef997bebb0f5e27c137ef11 Mon Sep 17 00:00:00 2001 From: Kevin Tyle Date: Tue, 22 Jul 2025 18:30:50 +0000 Subject: [PATCH 5/6] Fix formatting in Data Access/complex Cookbooks section --- portal/cookbook-guide.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/portal/cookbook-guide.md b/portal/cookbook-guide.md index 12d62286f..25c79f198 100644 --- a/portal/cookbook-guide.md +++ b/portal/cookbook-guide.md @@ -41,11 +41,12 @@ If a few data files whose total size amount to less than 50MB can power your Coo For many concepts, we encourage writing self-contained functions to generate simple representative datasets for demonstrating scientific concepts. Your Cookbook can even reuse these sample data repeatedly throughout. 4. **For complex Cookbooks that rely on large datasets that are not already accessible through other services**, we suggest two options: -Institutional Repositories -: Many universities, labs, and centers offer institutional repositories for storing data in a manner that makes it freely and readily available to the public. If you’re based at a university or a publicly funded research facility, check with your local library or data management office. If you are funded by NSF, you may be able to store your data on NSF NCAR’s [Research Data Archive (RDA)](https://rda.ucar.edu). + + a. **Institutional Repositories** + Many universities, labs, and centers offer institutional repositories for storing data in a manner that makes it freely and readily available to the public. If you’re based at a university or a publicly funded research facility, check with your local library or data management office. If you are funded by NSF, you may be able to store your data on NSF NCAR’s [Research Data Archive (RDA)](https://rda.ucar.edu). -Project Pythia's [NSF Jetstream2](https://jetstream-cloud.org) Object Store -: If you have created a larger dataset for your Cookbook and don’t have access to institutional resources of your own, Project Pythia may be able to provide a home on our cloud object store. Our [Ocean Biogeochemistry Cookbook](https://projectpythia.org/ocean-bgc-cookbook/notebooks/readintutorial) uses this option. Please [contact the Project Pythia team](https://discourse.pangeo.io/c/education/project-pythia/60) if you would like to explore this option. + b. **Project Pythia's [NSF Jetstream2](https://jetstream-cloud.org) Object Store** + If you have created a larger dataset for your Cookbook and don’t have access to institutional resources of your own, Project Pythia may be able to provide a home on our cloud object store. Our [Ocean Biogeochemistry Cookbook](https://projectpythia.org/ocean-bgc-cookbook/notebooks/readintutorial) uses this option. Please [contact the Project Pythia team](https://discourse.pangeo.io/c/education/project-pythia/60) if you would like to explore this option. ## B. Create a Repository From the Cookbook Template From aa192bbbae97115ab06bcbccffdee20ad54993a6 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Tue, 22 Jul 2025 18:31:16 +0000 Subject: [PATCH 6/6] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- portal/cookbook-guide.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/portal/cookbook-guide.md b/portal/cookbook-guide.md index 25c79f198..1f526aa5f 100644 --- a/portal/cookbook-guide.md +++ b/portal/cookbook-guide.md @@ -41,11 +41,11 @@ If a few data files whose total size amount to less than 50MB can power your Coo For many concepts, we encourage writing self-contained functions to generate simple representative datasets for demonstrating scientific concepts. Your Cookbook can even reuse these sample data repeatedly throughout. 4. **For complex Cookbooks that rely on large datasets that are not already accessible through other services**, we suggest two options: - - a. **Institutional Repositories** + + a. **Institutional Repositories** Many universities, labs, and centers offer institutional repositories for storing data in a manner that makes it freely and readily available to the public. If you’re based at a university or a publicly funded research facility, check with your local library or data management office. If you are funded by NSF, you may be able to store your data on NSF NCAR’s [Research Data Archive (RDA)](https://rda.ucar.edu). - b. **Project Pythia's [NSF Jetstream2](https://jetstream-cloud.org) Object Store** + b. **Project Pythia's [NSF Jetstream2](https://jetstream-cloud.org) Object Store** If you have created a larger dataset for your Cookbook and don’t have access to institutional resources of your own, Project Pythia may be able to provide a home on our cloud object store. Our [Ocean Biogeochemistry Cookbook](https://projectpythia.org/ocean-bgc-cookbook/notebooks/readintutorial) uses this option. Please [contact the Project Pythia team](https://discourse.pangeo.io/c/education/project-pythia/60) if you would like to explore this option. ## B. Create a Repository From the Cookbook Template