-
Notifications
You must be signed in to change notification settings - Fork 913
Description
I was trying to optimize the performance for a high-throughput use case (proposed by @janosh) where one might need to load/save large quantities of Structure
as JSON, and I'm currently starting with the saving phase of the workflow.
A quick line profiling reveals two significant bottlenecks:
as_dict
method ofStructure
: I did some preliminary profiling, and looks like this is owing to (unnecessarily) repeated calculation ofLattice
properties likelengths/angles
during serialization (and I believe proper caching could resolve this with correct cache voiding mechanism upon internal_matrix
change). But this would be a separate PR so won't cover too much here.- Topic of this thread: slow
json.dump
to file
Total time: 65.3412 s
File: show_structure_json_bottleneck.py
Function: save_to_file at line 38
Line # Hits Time Per Hit % Time Line Contents
==============================================================
38 @profile
39 def save_to_file(tmpdir):
40 1001 921.1 0.9 0.0 for i, struct in enumerate(STRUCTURES):
41 1000 32742869.8 32742.9 50.1 d = struct.as_dict()
42 1000 9134.1 9.1 0.0 path = os.path.join(tmpdir, f"struct_{i}.json")
43 2000 77023.5 38.5 0.1 with open(path, "w") as f:
44 1000 32511267.6 32511.3 49.8 json.dump(d, f)
And if we replace builtin json
with the faster drop-in replacement orjson
(I'm not aware of any caveat so far), we got (tested on WSL2 Ubuntu 22.04):
- almost 10x speed up converting Structure dict to string
- more than 10x speed up for saving Structure dict to file
- a little speed up loading JSON back
== Dump to String ==
json: 0.9821 s
orjson: 0.1173 s
== Save to File ==
json_plain: 3.2918 s, 105549.10 KB
orjson_plain: 0.2284 s, 97192.09 KB
json_gz: 8.0444 s, 32786.10 KB
orjson_gz: 3.3784 s, 32217.06 KB
== Load from File ==
json_plain: 3.1989 s
orjson_plain: 2.6992 s
json_gz: 3.4490 s
orjson_gz: 2.9699 s
Script: compare_structure_json.txt
As the speed up is quite obvious with almost no code change (also considering JSON is heavily used around the code base, so not only Structure
would benefit from this), perhaps we could consider adding orjson
as the default JSON handler? (I'm afraid adding it as an optional dependency would incur too much code change to detect whether orjson
is available)
But also I'm a bit unsure about the migration process as current json.dumps
takes kwargs
(e.g. #4295) and this might cause breakage:
pymatgen/src/pymatgen/core/structure.py
Line 2963 in 1e36ee5
json_str = json.dumps(self.as_dict(), **kwargs) |
monty
JSON and lack of custom JSON encoder support
orjson
is already an optional dependency for monty
so maybe it's also good for us to double check if orjson
is fully used from monty
's side as quite some JSON-related workloads are implicitly handled by monty
.
Also as orjson
doesn't directly support custom encoder/decoder (e.g. MontyEncoder
), some more thoughts might be needed to this end