-
Notifications
You must be signed in to change notification settings - Fork 16
Optimize with dask #1981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Optimize with dask #1981
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
aacc545
using dask for read/write large files
aysim319 dbde5c7
undo testing change and also using datetime instead of str for date p…
aysim319 1394d3d
refactored reading into seperate function
aysim319 dfc3be2
organizing code
aysim319 e07c697
only procesing once and passing along the dataframe
aysim319 d1ee4ce
added/updated tests
aysim319 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file removed
BIN
-6.26 MB
doctor_visits/delphi_doctor_visits/input/SYNEDI_AGG_OUTPATIENT_18052020_1455CDT.csv.gz
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
import dask.dataframe as dd | ||
from datetime import datetime | ||
import numpy as np | ||
import pandas as pd | ||
from pathlib import Path | ||
|
||
from .config import Config | ||
|
||
|
||
def write_to_csv(output_df: pd.DataFrame, geo_level: str, se:bool, out_name: str, logger, output_path="."): | ||
"""Write sensor values to csv. | ||
|
||
Args: | ||
output_dict: dictionary containing sensor rates, se, unique dates, and unique geo_id | ||
geo_level: geographic resolution, one of ["county", "state", "msa", "hrr", "nation", "hhs"] | ||
se: boolean to write out standard errors, if true, use an obfuscated name | ||
out_name: name of the output file | ||
output_path: outfile path to write the csv (default is current directory) | ||
""" | ||
if se: | ||
logger.info(f"========= WARNING: WRITING SEs TO {out_name} =========") | ||
|
||
out_n = 0 | ||
for d in set(output_df["date"]): | ||
filename = "%s/%s_%s_%s.csv" % (output_path, | ||
(d + Config.DAY_SHIFT).strftime("%Y%m%d"), | ||
geo_level, | ||
out_name) | ||
single_date_df = output_df[output_df["date"] == d] | ||
with open(filename, "w") as outfile: | ||
outfile.write("geo_id,val,se,direction,sample_size\n") | ||
|
||
for line in single_date_df.itertuples(): | ||
geo_id = line.geo_id | ||
sensor = 100 * line.val # report percentages | ||
se_val = 100 * line.se | ||
assert not np.isnan(sensor), "sensor value is nan, check pipeline" | ||
assert sensor < 90, f"strangely high percentage {geo_id, sensor}" | ||
if not np.isnan(se_val): | ||
assert se_val < 5, f"standard error suspiciously high! investigate {geo_id}" | ||
|
||
if se: | ||
assert sensor > 0 and se_val > 0, "p=0, std_err=0 invalid" | ||
outfile.write( | ||
"%s,%f,%s,%s,%s\n" % (geo_id, sensor, se_val, "NA", "NA")) | ||
else: | ||
# for privacy reasons we will not report the standard error | ||
outfile.write( | ||
"%s,%f,%s,%s,%s\n" % (geo_id, sensor, "NA", "NA", "NA")) | ||
out_n += 1 | ||
logger.debug(f"wrote {out_n} rows for {geo_level}") | ||
|
||
|
||
def csv_to_df(filepath: str, startdate: datetime, enddate: datetime, dropdate: datetime, logger) -> pd.DataFrame: | ||
''' | ||
Reads csv using Dask and filters out based on date range and currently unused column, | ||
then converts back into pandas dataframe. | ||
Parameters | ||
---------- | ||
filepath: path to the aggregated doctor-visits data | ||
startdate: first sensor date (YYYY-mm-dd) | ||
enddate: last sensor date (YYYY-mm-dd) | ||
dropdate: data drop date (YYYY-mm-dd) | ||
|
||
------- | ||
''' | ||
filepath = Path(filepath) | ||
logger.info(f"Processing {filepath}") | ||
|
||
ddata = dd.read_csv( | ||
filepath, | ||
compression="gzip", | ||
dtype=Config.DTYPES, | ||
blocksize=None, | ||
) | ||
|
||
ddata = ddata.dropna() | ||
# rename inconsistent column names to match config column names | ||
ddata = ddata.rename(columns=Config.DEVIANT_COLS_MAP) | ||
|
||
ddata = ddata[Config.FILT_COLS] | ||
ddata[Config.DATE_COL] = dd.to_datetime(ddata[Config.DATE_COL]) | ||
|
||
# restrict to training start and end date | ||
startdate = startdate - Config.DAY_SHIFT | ||
|
||
assert startdate > Config.FIRST_DATA_DATE, "Start date <= first day of data" | ||
assert startdate < enddate, "Start date >= end date" | ||
assert enddate <= dropdate, "End date > drop date" | ||
|
||
date_filter = ((ddata[Config.DATE_COL] >= Config.FIRST_DATA_DATE) & (ddata[Config.DATE_COL] < dropdate)) | ||
|
||
df = ddata[date_filter].compute() | ||
|
||
# aggregate age groups (so data is unique by service date and FIPS) | ||
df = df.groupby([Config.DATE_COL, Config.GEO_COL]).sum(numeric_only=True).reset_index() | ||
assert np.sum(df.duplicated()) == 0, "Duplicates after age group aggregation" | ||
assert (df[Config.COUNT_COLS] >= 0).all().all(), "Counts must be nonnegative" | ||
|
||
logger.info(f"Done processing {filepath}") | ||
return df |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,7 @@ | |
"pytest-cov", | ||
"pytest", | ||
"scikit-learn", | ||
"dask", | ||
] | ||
|
||
setup( | ||
|
Binary file added
BIN
+779 KB
.../tests/comparison/process_data/main_after_date_SYNEDI_AGG_OUTPATIENT_07022020_1455CDT.pkl
Binary file not shown.
Binary file added
BIN
+779 KB
doctor_visits/tests/test_data/SYNEDI_AGG_OUTPATIENT_07022020_1455CDT.pkl
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
"""Tests for update_sensor.py.""" | ||
from datetime import datetime | ||
import logging | ||
import pandas as pd | ||
|
||
from delphi_doctor_visits.process_data import csv_to_df | ||
|
||
TEST_LOGGER = logging.getLogger() | ||
|
||
class TestProcessData: | ||
def test_csv_to_df(self): | ||
actual = csv_to_df( | ||
filepath="./test_data/SYNEDI_AGG_OUTPATIENT_07022020_1455CDT.csv.gz", | ||
startdate=datetime(2020, 2, 4), | ||
enddate=datetime(2020, 2, 5), | ||
dropdate=datetime(2020, 2,6), | ||
logger=TEST_LOGGER, | ||
) | ||
|
||
comparison = pd.read_pickle("./comparison/process_data/main_after_date_SYNEDI_AGG_OUTPATIENT_07022020_1455CDT.pkl") | ||
pd.testing.assert_frame_equal(actual.reset_index(drop=True), comparison) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 we should start doing type specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah but I think that should be a seperate ticket maybe? don't want to have more PR/confusion within already scoped out feature.