API reference
Core
Vectorized vector I/O using OGR.
- pyogrio.detect_write_driver(path)
Attempt to infer the driver for a path by extension or prefix.
Only drivers that support write capabilities will be detected.
If the path cannot be resolved to a single driver, a ValueError will be raised.
- Parameters:
- pathstr
data source path
- Returns:
- str
name of the driver, if detected
- pyogrio.get_gdal_config_option(name)
Get the value for a GDAL configuration option.
- Parameters:
- namestr
name of the option to retrive
- Returns:
- value of the option or None if not set
'ON'
/'OFF'
are normalized toTrue
/False
.
- pyogrio.list_drivers(read=False, write=False)
List drivers available in GDAL.
- Parameters:
- read: bool, optional (default: False)
If True, will only return drivers that are known to support read capabilities.
- write: bool, optional (default: False)
If True, will only return drivers that are known to support write capabilities.
- Returns:
- dict
Mapping of driver name to file mode capabilities:
"r"
: read,"w"
: write. Drivers that are available but with unknown support are marked with"?"
- pyogrio.list_layers(path_or_buffer, /)
List layers available in an OGR data source.
NOTE: includes both spatial and nonspatial layers.
- Parameters:
- path_or_bufferstr, pathlib.Path, bytes, or file-like
A dataset path or URI, raw buffer, or file-like object with a read method.
- Returns:
- ndarray shape (2, n)
array of pairs of [<layer name>, <layer geometry type>] Note: geometry is None for nonspatial layers.
- pyogrio.read_bounds(path_or_buffer, /, layer=None, skip_features=0, max_features=None, where=None, bbox=None, mask=None)
Read bounds of each feature.
This can be used to assist with spatial indexing and partitioning, in order to avoid reading all features into memory. It is roughly 2-3x faster than reading the full geometry and attributes of a dataset.
- Parameters:
- path_or_bufferstr, pathlib.Path, bytes, or file-like
A dataset path or URI, raw buffer, or file-like object with a read method.
- layerint or str, optional (default: first layer)
If an integer is provided, it corresponds to the index of the layer with the data source. If a string is provided, it must match the name of the layer in the data source. Defaults to first layer in data source.
- skip_featuresint, optional (default: 0)
Number of features to skip from the beginning of the file before returning features. Must be less than the total number of features in the file.
- max_featuresint, optional (default: None)
Number of features to read from the file. Must be less than the total number of features in the file minus
skip_features
(if used).- wherestr, optional (default: None)
Where clause to filter features in layer by attribute values. Uses a restricted form of SQL WHERE clause, defined here: http://ogdi.sourceforge.net/prop/6.2.CapabilitiesMetadata.html Examples:
"ISO_A3 = 'CAN'"
,"POP_EST > 10000000 AND POP_EST < 100000000"
- bboxtuple of (xmin, ymin, xmax, ymax), optional (default: None)
If present, will be used to filter records whose geometry intersects this box. This must be in the same CRS as the dataset. If GEOS is present and used by GDAL, only geometries that intersect this bbox will be returned; if GEOS is not available or not used by GDAL, all geometries with bounding boxes that intersect this bbox will be returned.
- maskShapely geometry, optional (default: None)
If present, will be used to filter records whose geometry intersects this geometry. This must be in the same CRS as the dataset. If GEOS is present and used by GDAL, only geometries that intersect this geometry will be returned; if GEOS is not available or not used by GDAL, all geometries with bounding boxes that intersect the bounding box of this geometry will be returned. Requires Shapely >= 2.0. Cannot be combined with
bbox
keyword.
- Returns:
- tuple of (fids, bounds)
fids are global IDs read from the FID field of the dataset bounds are ndarray of shape(4, n) containing
xmin
,ymin
,xmax
,ymax
- pyogrio.read_info(path_or_buffer, /, layer=None, encoding=None, force_feature_count=False, force_total_bounds=False, **kwargs)
Read information about an OGR data source.
crs
,geometry
andtotal_bounds
will beNone
andfeatures
will be 0 for a nonspatial layer.features
will be -1 if this is an expensive operation for this driver. You can force it to be calculated using theforce_feature_count
parameter.total_bounds
is the 2-dimensional extent of all features within the dataset: (xmin, ymin, xmax, ymax). It will be None if this is an expensive operation for this driver or if the data source is nonspatial. You can force it to be calculated using theforce_total_bounds
parameter.fid_column
is the name of the FID field in the data source, if the FID is physically stored (e.g. in GPKG). If the FID is just a sequence,fid_column
will be “” (e.g. ESRI Shapefile).geometry_name
is the name of the field where the main geometry is stored in the data data source, if the field name can by customized (e.g. in GPKG). If no custom name is supported,geometry_name
will be “” (e.g. ESRI Shapefile).encoding
will beUTF-8
if either the native encoding is likely to beUTF-8
or GDAL can automatically convert from the detected native encoding toUTF-8
.- Parameters:
- path_or_bufferstr, pathlib.Path, bytes, or file-like
A dataset path or URI, raw buffer, or file-like object with a read method.
- layer[type], optional
Name or index of layer in data source. Reads the first layer by default.
- encoding[type], optional (default: None)
If present, will be used as the encoding for reading string values from the data source, unless encoding can be inferred directly from the data source.
- force_feature_countbool, optional (default: False)
True if the feature count should be computed even if it is expensive.
- force_total_boundsbool, optional (default: False)
True if the total bounds should be computed even if it is expensive.
- **kwargs
Additional driver-specific dataset open options passed to OGR. Invalid options will trigger a warning.
- Returns:
- dict
A dictionary with the following keys:
{ "layer_name": "<layer name>", "crs": "<crs>", "fields": <ndarray of field names>, "dtypes": <ndarray of field dtypes>, "encoding": "<encoding>", "fid_column": "<fid column name or "">", "geometry_name": "<geometry column name or "">", "geometry_type": "<geometry type>", "features": <feature count or -1>, "total_bounds": <tuple with total bounds or None>, "driver": "<driver>", "capabilities": "<dict of driver capabilities>" "dataset_metadata": "<dict of dataset metadata or None>" "layer_metadata": "<dict of layer metadata or None>" }
- pyogrio.set_gdal_config_options(options)
Set GDAL configuration options.
Options are listed here: https://trac.osgeo.org/gdal/wiki/ConfigOptions
No error is raised if invalid option names are provided.
These options are applied for an entire session rather than for individual functions.
- Parameters:
- optionsdict
If present, provides a mapping of option name / value pairs for GDAL configuration options.
True
/False
are normalized to'ON'
/'OFF'
. A value ofNone
for a config option can be used to clear out a previously set value.
- pyogrio.vsi_listtree(path: str | Path, pattern: str | None = None)
Recursively list the contents of a VSI directory.
An fnmatch pattern can be specified to filter the directories/files returned.
- Parameters:
- pathstr or pathlib.Path
Path to the VSI directory to be listed.
- patternstr, optional
Pattern to filter results, in fnmatch format.
- pyogrio.vsi_rmtree(path: str | Path)
Recursively remove VSI directory.
- Parameters:
- pathstr or pathlib.Path
path to the VSI directory to be removed.
- pyogrio.vsi_unlink(path: str | Path)
Remove a VSI file.
- Parameters:
- pathstr or pathlib.Path
path to vsimem file to be removed
GeoPandas integration
- pyogrio.read_dataframe(path_or_buffer, /, layer=None, encoding=None, columns=None, read_geometry=True, force_2d=False, skip_features=0, max_features=None, where=None, bbox=None, mask=None, fids=None, sql=None, sql_dialect=None, fid_as_index=False, use_arrow=None, on_invalid='raise', arrow_to_pandas_kwargs=None, **kwargs)
Read from an OGR data source to a GeoPandas GeoDataFrame or Pandas DataFrame.
If the data source does not have a geometry column or
read_geometry
is False, a DataFrame will be returned.Requires
geopandas
>= 0.8.- Parameters:
- path_or_bufferpathlib.Path or str, or bytes buffer
A dataset path or URI, raw buffer, or file-like object with a read method.
- layerint or str, optional (default: first layer)
If an integer is provided, it corresponds to the index of the layer with the data source. If a string is provided, it must match the name of the layer in the data source. Defaults to first layer in data source.
- encodingstr, optional (default: None)
If present, will be used as the encoding for reading string values from the data source. By default will automatically try to detect the native encoding and decode to
UTF-8
.- columnslist-like, optional (default: all columns)
List of column names to import from the data source. Column names must exactly match the names in the data source, and will be returned in the order they occur in the data source. To avoid reading any columns, pass an empty list-like. If combined with
where
parameter, must include columns referenced in thewhere
expression or the data may not be correctly read; the data source may return empty results or raise an exception (behavior varies by driver).- read_geometrybool, optional (default: True)
If True, will read geometry into a GeoSeries. If False, a Pandas DataFrame will be returned instead.
- force_2dbool, optional (default: False)
If the geometry has Z values, setting this to True will cause those to be ignored and 2D geometries to be returned
- skip_featuresint, optional (default: 0)
Number of features to skip from the beginning of the file before returning features. If greater than available number of features, an empty DataFrame will be returned. Using this parameter may incur significant overhead if the driver does not support the capability to randomly seek to a specific feature, because it will need to iterate over all prior features.
- max_featuresint, optional (default: None)
Number of features to read from the file.
- wherestr, optional (default: None)
Where clause to filter features in layer by attribute values. If the data source natively supports SQL, its specific SQL dialect should be used (eg. SQLite and GeoPackage: SQLITE, PostgreSQL). If it doesn’t, the OGRSQL WHERE syntax should be used. Note that it is not possible to overrule the SQL dialect, this is only possible when you use the
sql
parameter. Examples:"ISO_A3 = 'CAN'"
,"POP_EST > 10000000 AND POP_EST < 100000000"
- bboxtuple of (xmin, ymin, xmax, ymax) (default: None)
If present, will be used to filter records whose geometry intersects this box. This must be in the same CRS as the dataset. If GEOS is present and used by GDAL, only geometries that intersect this bbox will be returned; if GEOS is not available or not used by GDAL, all geometries with bounding boxes that intersect this bbox will be returned. Cannot be combined with
mask
keyword.- maskShapely geometry, optional (default: None)
If present, will be used to filter records whose geometry intersects this geometry. This must be in the same CRS as the dataset. If GEOS is present and used by GDAL, only geometries that intersect this geometry will be returned; if GEOS is not available or not used by GDAL, all geometries with bounding boxes that intersect the bounding box of this geometry will be returned. Requires Shapely >= 2.0. Cannot be combined with
bbox
keyword.- fidsarray-like, optional (default: None)
Array of integer feature id (FID) values to select. Cannot be combined with other keywords to select a subset (
skip_features
,max_features
,where
,bbox
,mask
, orsql
). Note that the starting index is driver and file specific (e.g. typically 0 for Shapefile and 1 for GeoPackage, but can still depend on the specific file). The performance of reading a large number of features usings FIDs is also driver specific and depends on the value ofuse_arrow
. The order of the rows returned is undefined. If you would like to sort based on FID, usefid_as_index=True
to have the index of the GeoDataFrame returned set to the FIDs of the features read. Ifuse_arrow=True
, the number of FIDs is limited to 4997 for drivers with ‘OGRSQL’ as default SQL dialect. To read a larger number of FIDs, setuser_arrow=False
.- sqlstr, optional (default: None)
The SQL statement to execute. Look at the sql_dialect parameter for more information on the syntax to use for the query. When combined with other keywords like
columns
,skip_features
,max_features
,where
,bbox
, ormask
, those are applied after the SQL query. Be aware that this can have an impact on performance, (e.g. filtering with thebbox
ormask
keywords may not use spatial indexes). Cannot be combined with thelayer
orfids
keywords.- sql_dialectstr, optional (default: None)
The SQL dialect the SQL statement is written in. Possible values:
None: if the data source natively supports SQL, its specific SQL dialect will be used by default (eg. SQLite and Geopackage: SQLITE, PostgreSQL). If the data source doesn’t natively support SQL, the OGRSQL dialect is the default.
‘OGRSQL’: can be used on any data source. Performance can suffer when used on data sources with native support for SQL.
‘SQLITE’: can be used on any data source. All spatialite functions can be used. Performance can suffer on data sources with native support for SQL, except for Geopackage and SQLite as this is their native SQL dialect.
- fid_as_indexbool, optional (default: False)
If True, will use the FIDs of the features that were read as the index of the GeoDataFrame. May start at 0 or 1 depending on the driver.
- use_arrowbool, optional (default: False)
Whether to use Arrow as the transfer mechanism of the read data from GDAL to Python (requires GDAL >= 3.6 and pyarrow to be installed). When enabled, this provides a further speed-up. Defaults to False, but this default can also be globally overridden by setting the
PYOGRIO_USE_ARROW=1
environment variable.- on_invalidstr, optional (default: “raise”)
The action to take when an invalid geometry is encountered. Possible values:
raise: an exception will be raised if a WKB input geometry is invalid.
warn: invalid WKB geometries will be returned as
None
and a warning will be raised.ignore: invalid WKB geometries will be returned as
None
without a warning.
- arrow_to_pandas_kwargsdict, optional (default: None)
When use_arrow is True, these kwargs will be passed to the to_pandas call for the arrow to pandas conversion.
- **kwargs
Additional driver-specific dataset open options passed to OGR. Invalid options will trigger a warning.
- Returns:
- GeoDataFrame or DataFrame (if no geometry is present)
- pyogrio.write_dataframe(df, path, layer=None, driver=None, encoding=None, geometry_type=None, promote_to_multi=None, nan_as_null=True, append=False, use_arrow=None, dataset_metadata=None, layer_metadata=None, metadata=None, dataset_options=None, layer_options=None, **kwargs)
Write GeoPandas GeoDataFrame to an OGR file format.
- Parameters:
- dfGeoDataFrame or DataFrame
The data to write. For attribute columns of the “object” dtype, all values will be converted to strings to be written to the output file, except None and np.nan, which will be set to NULL in the output file.
- pathstr or io.BytesIO
path to output file on writeable file system or an io.BytesIO object to allow writing to memory. Will raise NotImplementedError if an open file handle is passed; use BytesIO instead. NOTE: support for writing to memory is limited to specific drivers.
- layerstr, optional (default: None)
layer name to create. If writing to memory and layer name is not provided, it layer name will be set to a UUID4 value.
- driverstring, optional (default: None)
The OGR format driver used to write the vector file. By default attempts to infer driver from path. Must be provided to write to memory.
- encodingstr, optional (default: None)
If present, will be used as the encoding for writing string values to the file. Use with caution, only certain drivers support encodings other than UTF-8.
- geometry_typestring, optional (default: None)
By default, the geometry type of the layer will be inferred from the data, after applying the promote_to_multi logic. If the data only contains a single geometry type (after applying the logic of promote_to_multi), this type is used for the layer. If the data (still) contains mixed geometry types, the output layer geometry type will be set to “Unknown”.
This parameter does not modify the geometry, but it will try to force the layer type of the output file to this value. Use this parameter with caution because using a non-default layer geometry type may result in errors when writing the file, may be ignored by the driver, or may result in invalid files. Possible values are: “Unknown”, “Point”, “LineString”, “Polygon”, “MultiPoint”, “MultiLineString”, “MultiPolygon” or “GeometryCollection”.
- promote_to_multibool, optional (default: None)
If True, will convert singular geometry types in the data to their corresponding multi geometry type for writing. By default, will convert mixed singular and multi geometry types to multi geometry types for drivers that do not support mixed singular and multi geometry types. If False, geometry types will not be promoted, which may result in errors or invalid files when attempting to write mixed singular and multi geometry types to drivers that do not support such combinations.
- nan_as_nullbool, default True
For floating point columns (float32 / float64), whether NaN values are written as “null” (missing value). Defaults to True because in pandas NaNs are typically used as missing value. Note that when set to False, behaviour is format specific: some formats don’t support NaNs by default (e.g. GeoJSON will skip this property) or might treat them as null anyway (e.g. GeoPackage).
- appendbool, optional (default: False)
If True, the data source specified by path already exists, and the driver supports appending to an existing data source, will cause the data to be appended to the existing records in the data source. Not supported for writing to in-memory files. NOTE: append support is limited to specific drivers and GDAL versions.
- use_arrowbool, optional (default: False)
Whether to use Arrow as the transfer mechanism of the data to write from Python to GDAL (requires GDAL >= 3.8 and pyarrow to be installed). When enabled, this provides a further speed-up. Defaults to False, but this default can also be globally overridden by setting the
PYOGRIO_USE_ARROW=1
environment variable. Using Arrow does not support writing an object-dtype column with mixed types.- dataset_metadatadict, optional (default: None)
Metadata to be stored at the dataset level in the output file; limited to drivers that support writing metadata, such as GPKG, and silently ignored otherwise. Keys and values must be strings.
- layer_metadatadict, optional (default: None)
Metadata to be stored at the layer level in the output file; limited to drivers that support writing metadata, such as GPKG, and silently ignored otherwise. Keys and values must be strings.
- metadatadict, optional (default: None)
alias of layer_metadata
- dataset_optionsdict, optional
Dataset creation options (format specific) passed to OGR. Specify as a key-value dictionary.
- layer_optionsdict, optional
Layer creation options (format specific) passed to OGR. Specify as a key-value dictionary.
- **kwargs
Additional driver-specific dataset or layer creation options passed to OGR. pyogrio will attempt to automatically pass those keywords either as dataset or as layer creation option based on the known options for the specific driver. Alternatively, you can use the explicit dataset_options or layer_options keywords to manually do this (for example if an option exists as both dataset and layer option).
Arrow integration
- pyogrio.read_arrow(path_or_buffer, /, layer=None, encoding=None, columns=None, read_geometry=True, force_2d=False, skip_features=0, max_features=None, where=None, bbox=None, mask=None, fids=None, sql=None, sql_dialect=None, return_fids=False, **kwargs)
Read OGR data source into a pyarrow Table.
See docstring of read for parameters.
- Returns:
- (dict, pyarrow.Table)
Returns a tuple of meta information about the data source in a dict, and a pyarrow Table with data.
- Meta is: {
“crs”: “<crs>”, “fields”: <ndarray of field names>, “encoding”: “<encoding>”, “geometry_type”: “<geometry_type>”, “geometry_name”: “<name of geometry column in arrow table>”,
}
- pyogrio.open_arrow(path_or_buffer, /, layer=None, encoding=None, columns=None, read_geometry=True, force_2d=False, skip_features=0, max_features=None, where=None, bbox=None, mask=None, fids=None, sql=None, sql_dialect=None, return_fids=False, batch_size=65536, use_pyarrow=False, **kwargs)
Open OGR data source as a stream of Arrow record batches.
See docstring of read for parameters.
The returned object is reading from a stream provided by OGR and must not be accessed after the OGR dataset has been closed, i.e. after the context manager has been closed.
By default this functions returns a generic stream object implementing the Arrow PyCapsule Protocol (i.e. having an
__arrow_c_stream__
method). This object can then be consumed by your Arrow implementation of choice that supports this protocol. Optionally, you can specifyuse_pyarrow=True
to directly get the stream as a pyarrow.RecordBatchReader.- Returns:
- (dict, pyarrow.RecordBatchReader or ArrowStream)
Returns a tuple of meta information about the data source in a dict, and a data stream object (a generic ArrowStream object, or a pyarrow RecordBatchReader if use_pyarrow is set to True).
- Meta is: {
“crs”: “<crs>”, “fields”: <ndarray of field names>, “encoding”: “<encoding>”, “geometry_type”: “<geometry_type>”, “geometry_name”: “<name of geometry column in arrow table>”,
}
- Other Parameters:
- batch_sizeint (default: 65_536)
Maximum number of features to retrieve in a batch.
- use_pyarrowbool (default: False)
If True, return a pyarrow RecordBatchReader instead of a generic ArrowStream object. In the default case, this stream object needs to be passed to another library supporting the Arrow PyCapsule Protocol to consume the stream of data.
Examples
>>> from pyogrio.raw import open_arrow >>> import pyarrow as pa >>> import shapely >>> >>> with open_arrow(path) as source: >>> meta, stream = source >>> # wrap the arrow stream object in a pyarrow RecordBatchReader >>> reader = pa.RecordBatchReader.from_stream(stream) >>> geom_col = meta["geometry_name"] or "wkb_geometry" >>> for batch in reader: >>> geometries = shapely.from_wkb(batch[geom_col])
The returned stream object needs to be consumed by a library implementing the Arrow PyCapsule Protocol. In the above example, pyarrow is used through its RecordBatchReader. For this case, you can also specify
use_pyarrow=True
to directly get this result as a short-cut:>>> with open_arrow(path, use_pyarrow=True) as source: >>> meta, reader = source >>> geom_col = meta["geometry_name"] or "wkb_geometry" >>> for batch in reader: >>> geometries = shapely.from_wkb(batch[geom_col])
- pyogrio.write_arrow(arrow_obj, path, layer=None, driver=None, geometry_name=None, geometry_type=None, crs=None, encoding=None, append=False, dataset_metadata=None, layer_metadata=None, metadata=None, dataset_options=None, layer_options=None, **kwargs)
Write an Arrow-compatible data source to an OGR file format.
- Parameters:
- arrow_obj
The Arrow data to write. This can be any Arrow-compatible tabular data object that implements the Arrow PyCapsule Protocol (i.e. has an
__arrow_c_stream__
method), for example a pyarrow Table or RecordBatchReader.- pathstr or io.BytesIO
path to output file on writeable file system or an io.BytesIO object to allow writing to memory NOTE: support for writing to memory is limited to specific drivers.
- layerstr, optional (default: None)
layer name to create. If writing to memory and layer name is not provided, it layer name will be set to a UUID4 value.
- driverstring, optional (default: None)
The OGR format driver used to write the vector file. By default attempts to infer driver from path. Must be provided to write to memory.
- geometry_namestr, optional (default: None)
The name of the column in the input data that will be written as the geometry field. Will be inferred from the input data if the geometry column is annotated as an “geoarrow.wkb” or “ogc.wkb” extension type. Otherwise needs to be specified explicitly.
- geometry_typestr
The geometry type of the written layer. Currently, this needs to be specified explicitly when creating a new layer with geometries. Possible values are: “Unknown”, “Point”, “LineString”, “Polygon”, “MultiPoint”, “MultiLineString”, “MultiPolygon” or “GeometryCollection”.
This parameter does not modify the geometry, but it will try to force the layer type of the output file to this value. Use this parameter with caution because using a wrong layer geometry type may result in errors when writing the file, may be ignored by the driver, or may result in invalid files.
- crsstr, optional (default: None)
WKT-encoded CRS of the geometries to be written.
- encodingstr, optional (default: None)
Only used for the .dbf file of ESRI Shapefiles. If not specified, uses the default locale.
- appendbool, optional (default: False)
If True, the data source specified by path already exists, and the driver supports appending to an existing data source, will cause the data to be appended to the existing records in the data source. Not supported for writing to in-memory files. NOTE: append support is limited to specific drivers and GDAL versions.
- dataset_metadatadict, optional (default: None)
Metadata to be stored at the dataset level in the output file; limited to drivers that support writing metadata, such as GPKG, and silently ignored otherwise. Keys and values must be strings.
- layer_metadatadict, optional (default: None)
Metadata to be stored at the layer level in the output file; limited to drivers that support writing metadata, such as GPKG, and silently ignored otherwise. Keys and values must be strings.
- metadatadict, optional (default: None)
alias of layer_metadata
- dataset_optionsdict, optional
Dataset creation options (format specific) passed to OGR. Specify as a key-value dictionary.
- layer_optionsdict, optional
Layer creation options (format specific) passed to OGR. Specify as a key-value dictionary.
- **kwargs
Additional driver-specific dataset or layer creation options passed to OGR. pyogrio will attempt to automatically pass those keywords either as dataset or as layer creation option based on the known options for the specific driver. Alternatively, you can use the explicit dataset_options or layer_options keywords to manually do this (for example if an option exists as both dataset and layer option).