Skip to content

Conversation

jorisvandenbossche
Copy link
Member

Related to the discussion in #75

This is a very simple change (not yet trying to detect if mixed geometries are truly mixed or only single/multi of the same type): if there is more than 1 geometry type, we create the GDAL dataset with "Unknown" geometry type ("Unknown" here is the slightly confusing name for "any" geometry type: Use wkbUnknown if there are no constraints on the types geometry to be written. from https://gdal.org/api/raster_c_api.html#gdal_8h_1a7f01d3d8584f29e4ef3c56b5af49d816)

This is consistent with what Fiona (and therefore geopandas) currently does.
And I would argue that this is also what GDAL does to some extent (although it is of course not exactly equivalent situation). If you start from a source like GeoJSON which does not encode geometry type information and can store any mix of types (thus, very similar to a GeoDataFrame), GDAL (or ogrinfo) will infer it as "Unknown (any)", and for example when converting this GeoJSON file to a GPKG using ogr2ogr it will keep this "Unknown" geometry type.

Without this change, the tests that I added would fail (except for GeoJSON): for GPKG it "works" but sets a "Point" geometry type in the file, which is thus a "wrong" file. For FGB it actually errors.

@theroggy
Copy link
Member

theroggy commented Apr 23, 2022

Even though I admit the current behaviour is also a bit weird, for me personally this still rather feels like a step back than an improvement: without any warning/error you now lose the metadata in eg. a geopackage that a layer contains e.g. polygons.

So I'd rather go for the 2 lines of code to give the user the choice:

if output_geometrytype == 'GEOMETRY':
    geometry_type = "Unknown"

@theroggy
Copy link
Member

theroggy commented Apr 23, 2022

For the default behaviour of gdal, I suppose it will depend on the driver used to read the data or the way it is read. I have mostly experience with running SQL statements on geopackages, and for that case GDAL will pick the geometrytype of the first row fetched as geometrytype of the destination file... probably because it cannot know all the geometrytypes that will be fetched in the future.
I an ideal world it would keep track of all geometry types that were read/processed and streamed to the output file and then the geometry type of the column would be corrected in the destination file at the end. But I'm not sure if file formats typically support to alter the geometry type afterwards.

@theroggy
Copy link
Member

theroggy commented Apr 23, 2022

I also wondered how e.g. QGIS deals with a "geometry" column so I had a (very quick) look:

  • If you add a geopackage with a bit of a larger polygon layer with 4.8 million polygons/multipolygons where the layer type is "MultiPolygon" then the layer is instantly visible (if you are zoomed in somewhere).
  • If you add the same layer but where the layer has the type "Geometry", then QGIS has a delay of ~30 seconds - I suppose to scan the entire file to determine the types actually present in the file - and then it is shown as well. Afterwards it just treats the layer as a polygon layer for the symbology settings.

So, all in all QGIS at least tries to do the best job possible to handle the file and even though there is a delay, at least you can use it.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Apr 25, 2022

Note that this PR is meant as the baseline default behaviour (so not as a replacement for #75). We can still improve upon that for specific cases, such as detecting the case of single/multi geom of the same type and automatically doing something for that (cfr #75), or raising a warning if we use "Unknown" and let users overwrite it with an argument like geometry_type=, etc.

But to reiterate, the current behaviour on the main branch avoids writing data to some formats and is seemingly random in some cases (depending on the order of the rows). For example:

  • For GeoPackage, it will work (as it still does with this PR) but set a wrong geometry type in the file (and the actual value depending on the row order), creating an "invalid" file by default
  • For FlatGeobuf, it will always fail while the format actually does support mixed geometry types

For GeoJSON and Shapefile, nothing changes with this PR. GeoJSON simply doesn't care about geometry types, and Shapefile only supports a single type anyway (and for the case of single/multi of one type, passing "Unknown" is still fine as GDAL seems to infer the type in that case and writes a correct file).

I would argue that the change in this PR is therefore generally an improvement (compared to main).

I certainly agree that in the case of Polygon/MultiPolygon for GeoPackage, where it generally seems to be handled fine by readers like QGIS if the geometry type is not strictly correct (and even better handled compared to it set to "Geometry", according to your experiment). But again we can reiterate on this to find improvements for those specific cases on top of the general behaviour.

I have mostly experience with running SQL statements on geopackages, and for that case GDAL will pick the geometrytype of the first row fetched as geometrytype of the destination file.

Do you have an example of that? I tried ogr2ogr -where "col > 1" test_mixed2.gpkg test_mixed.gpkg where test_mixed.gpkg has mixed geometries (using "Unknown") and the where clause would ensure the resulting file has only "Polygon" geometries. But in that case the result still has the geometry type set to "Unknown".

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pulling this out as an isolated change @jorisvandenbossche

Bringing this in line with what Fiona / GeoPandas already does is a good argument for handling "Uknown" this way, since we want pyogrio to be a nearly seamless drop-in for Fiona in GeoPandas.

I see #75 building on top of this for mixed plurality but not type. I'd like to see both in before we cut a release.

raising a warning if we use "Unknown" and let users overwrite it with an argument like geometry_type=

I like this option; emit a warning for default behavior but letting the user explicitly sidestep that by giving us specific instructions.

jorisvandenbossche and others added 2 commits April 25, 2022 18:37
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
@jorisvandenbossche
Copy link
Member Author

raising a warning if we use "Unknown" and let users overwrite it with an argument like geometry_type=

I like this option; emit a warning for default behavior but letting the user explicitly sidestep that by giving us specific instructions.
...
Should we raise a warning here under an else branch?

I would prefer to defer to that another issue/PR, focusing here on the minimal change to align us with fiona/geopandas (and to follow the geopackage spec).
For example, I certainly think a warning can be useful in some cases, but I am personally also wary of raising it always. For example, why would we warn users about mixed geomtries when they are writing to GeoJSON, a format that simply doesn't care about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants