Skip to content

orjson for much faster JSON operations? #4385

@DanielYang59

Description

@DanielYang59

I was trying to optimize the performance for a high-throughput use case (proposed by @janosh) where one might need to load/save large quantities of Structure as JSON, and I'm currently starting with the saving phase of the workflow.

A quick line profiling reveals two significant bottlenecks:

  • as_dict method of Structure: I did some preliminary profiling, and looks like this is owing to (unnecessarily) repeated calculation of Lattice properties like lengths/angles during serialization (and I believe proper caching could resolve this with correct cache voiding mechanism upon internal _matrix change). But this would be a separate PR so won't cover too much here.
  • Topic of this thread: slow json.dump to file
Total time: 65.3412 s
File: show_structure_json_bottleneck.py
Function: save_to_file at line 38

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    38                                           @profile
    39                                           def save_to_file(tmpdir):
    40      1001        921.1      0.9      0.0      for i, struct in enumerate(STRUCTURES):
    41      1000   32742869.8  32742.9     50.1          d = struct.as_dict()
    42      1000       9134.1      9.1      0.0          path = os.path.join(tmpdir, f"struct_{i}.json")
    43      2000      77023.5     38.5      0.1          with open(path, "w") as f:
    44      1000   32511267.6  32511.3     49.8              json.dump(d, f)

And if we replace builtin json with the faster drop-in replacement orjson (I'm not aware of any caveat so far), we got (tested on WSL2 Ubuntu 22.04):

  • almost 10x speed up converting Structure dict to string
  • more than 10x speed up for saving Structure dict to file
  • a little speed up loading JSON back
== Dump to String ==
    json: 0.9821 s
  orjson: 0.1173 s

== Save to File ==
  json_plain: 3.2918 s, 105549.10 KB
orjson_plain: 0.2284 s, 97192.09 KB
     json_gz: 8.0444 s, 32786.10 KB
   orjson_gz: 3.3784 s, 32217.06 KB

== Load from File ==
  json_plain: 3.1989 s
orjson_plain: 2.6992 s
     json_gz: 3.4490 s
   orjson_gz: 2.9699 s

Script: compare_structure_json.txt

As the speed up is quite obvious with almost no code change (also considering JSON is heavily used around the code base, so not only Structure would benefit from this), perhaps we could consider adding orjson as the default JSON handler? (I'm afraid adding it as an optional dependency would incur too much code change to detect whether orjson is available)

But also I'm a bit unsure about the migration process as current json.dumps takes kwargs (e.g. #4295) and this might cause breakage:

json_str = json.dumps(self.as_dict(), **kwargs)

monty JSON and lack of custom JSON encoder support

orjson is already an optional dependency for monty so maybe it's also good for us to double check if orjson is fully used from monty's side as quite some JSON-related workloads are implicitly handled by monty.

Also as orjson doesn't directly support custom encoder/decoder (e.g. MontyEncoder), some more thoughts might be needed to this end

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions