State of STAC

By: Tim Schaub on August 31 2022

STAC-ing up! We gathered statistics on over 550 million assets. Check out the summary of our project and register to be included in our next crawl.

The number of imagery providers hosting data online has increased over the years. Search interfaces and metadata formats have also grown to describe and provide access to that data. The SpatioTemporal Asset Catalog (STAC) specification aims to provide a unified framework for describing and linking to earth observation data, with the goal of increasing the interoperability of search tools and making it easier to access the data.

Since its release in May of 2021, the specification has been gaining popularity, with massive archives of publicly accessible geospatial data being hosted with STAC compliant metadata. The USGS hosts STAC metadata for the entire Landsat Collection 2 archive. Microsoft’s Planetary Computer provides a STAC API with access to imagery archives of Landsat, MODIS, Sentinel, and more.

Planet sees STAC as a key component in making earth observation data more accessible to developers, and helping advance it is a core part of our “Open Initiatives” as previously described. STAC’s adoption has felt very rapid, but we wanted more concrete metrics to establish a baseline for its actual adoption. Building up knowledge of how data is exposed as STAC can help us understand how the standard is being used and how mature various extensions are. This, in turn, can help us prioritize what we work on to increase its uptake. So to better understand how STAC is being used, we recently deployed a crawler to visit publicly accessible STAC endpoints, recording millions of statistics about the catalogs, collections, items, and assets linked from those endpoints.

STAC in a Nutshell

The STAC specification describes a few different resource types: catalogs, collections, and items. Items represent the individual spatio-temporal data entries and have references to the data assets. Collections are used to group similar items. Catalogs are the top-level entry and can also be used as sub-groupings of collections or items.

STAC implementations come with two different flavors of interface: static files and API. The static flavor can be thought of as an online folder or directory of JSON documents linking to one another. The API flavor is designed to provide more efficient pagination through large collections of items and may provide more advanced search functionality.

Crawl Summary

Starting with just 27 endpoints, sourced from the amazing stacindex.org, our crawl visited over 120 million items, spread across almost four thousand collections. We gathered statistics on over 550 million assets referenced by those items. About 97% of the collections were from static catalogs, while over 99% of the items and assets came from STAC API implementations.

planetarycomputer.microsoft.com/api/stac/v1 landsatlook.usgs.gov/stac-server eod-catalog-svc-prod.astraea.earth stac.amskepler.com/v100 meeo-s5p.s3.amazonaws.com/catalog.json globalnightlight.s3.amazonaws.com/VIIRS_npp_catalog.json nasa-iserv.s3-us-west-2.amazonaws.com/catalog/catalog.json earth-search.aws.element84.com/v0 pta.data.lit.fmi.fi/stac/root.json franklin.nasa-hsi.azavea.com 10k 100k 1M 10M 100M
Top ten catalogs ordered by number of assets (log scale).
Static API Total
Catalogs 10,084 26,041 36,125
Collections 3,736 131 3,867
Items 1,077,616 122,258,062 123,335,678
Assets 1,497,447 550,133,772 551,631,219
Counts of catalogs, collections, items, and assets crawled, organized by implementation type.

The numbers suggest that people use collections far more frequently in static catalogs, presumably providing a way for users to explore reasonably small batches of items. While the STAC API implementations are dominated by catalogs with relatively small numbers of collections. Those API collections include many more items compared with their static counterparts. Because the STAC API provides search capabilities, catalogs and collections don’t need to be used as navigational aids in finding relevant items.

STAC Versions

The core STAC specification reached a stable 1.0 version in May of 2021, and the STAC API specification had its first release candidate toward 1.0 in March of 2022. Of the catalog, collection, and item resources we crawled, 88% conform with version 1.0.0. About 12% are still using version 1.0.0-beta.2. And a few stragglers are still on versions 1.0.0-rc.2, 1.0.0-rc.3, and 0.8.1.

Version Catalogs Collections Items
1.0.0 35,984 3,258 108,967,770
1.0.0-rc.3 0 0 10,000
1.0.0-rc.2 2 0 0
1.0.0-beta.2 123 597 14,352,545
0.8.1 16 12 5,345
Counts of catalogs, collections, and items organized by STAC version. These counts don’t distinguish between STAC API versions and core STAC versions.

It is encouraging to see the number of implementations that comply with the stable 1.0.0 version. Perhaps those still using an earlier version have a path to upgrade.

STAC Extensions

The core STAC specification limits itself to describing the spatial and temporal aspects of the data—specifying metadata fields for the geographic location and time range associated with the data. STAC extensions provide a way to add information specific to a given domain. Extensions exist for adding metadata specific to electro-optical data, for adding additional coordinate reference system information, for adding detail about raster bands, and more.

Of the 120 million STAC items crawled, 86% referenced at least one extension to the STAC core. In total, 46 different extensions were used by the catalogs, collections, and items visited in the crawl.

This information collected on extension usage should be useful to help give real metrics to establish the proper maturity classification of proposed extensions—suggesting which might be candidates to graduate to a stable classification or which might be candidates for deprecation due to lack of use.

STAC Asset Types

Assets in STAC provide access to the underlying spatio-temporal data. For STAC items representing satellite imagery, the assets could be GeoTIFF rasters. Items may have assets that refer to additional JSON or HTML documents. The STAC asset metadata includes a “type” property advertising the asset media type. Values like “image/tiff” represent a generic TIFF image; “image/tiff; application=geotiff; profile=cloud-optimized” represents the Cloud Optimized GeoTIFF variety.

Asset Type Count
application/json 101,831,216
application/xml 87,647,164
image/tiff; application=geotiff; profile=cloud-optimized 75,945,405
image/png 72,572,728
image/jpeg 50,259,307
text/plain 41,762,282
text/html 41,139,833
image/vnd.stac.geotiff; cloud-optimized=true 39,807,777
application/x-hdf 16,489,141
application/netcdf 10,252,878
image/jp2 3,639,899
application/gml+xml 3,619,721
application/vnd.laszip+copc 3,313,116
image/tiff; application=geotiff 946,859
application/x.mrf 902,162
text/xml 620,780
image/tiff 402,485
application/wmo-GRIB2 143,509
application/x-ndjson 143,509
application/x-netcdf 127,042
(missing) 25,051
application/html 10,000
application/vnd.google-earth.kml+xml 9,249
application/gzip 9,249
application/geopackage+sqlite3 5,854
application/geo+json 2,846
application/vnd+zarr 1,119
application/octet-stream 569
application/x-parquet 411
application/zip 58
Asset type counts.

At first glance, it is surprising to see the high counts of asset types that refer to additional metadata – the “application/json” and “application/xml” asset types, for example. We had anticipated that STAC assets would refer primarily to the spatiotemporal data instead of additional metadata describing that data. As it turns out, many implementations use assets to point to additional metadata that either doesn’t fit into STAC or provides an alternate representation of the same metadata found in the STAC resources. In total, just under 50% of the 550 million assets could be classified as metadata assets as opposed to actual data assets.

Considering all of its varieties together, TIFF is the most common asset type, representing over 20% of all assets crawled. About 99% of the TIFF assets are advertised as GeoTIFFs. It is possible the remaining TIFFs are GeoTIFFs as well, but simply use “image/tiff” as their asset type. Of the GeoTIFFs, over 99% are of the Cloud Optimized variety. It appears that people are still not sure about what media type is appropriate for Cloud Optimized GeoTIFF (COG). Although there is not yet an officially specified MIME type for COGs, there is growing consensus around “image/tiff; application=geotiff; profile=cloud-optimized” as a candidate for specification.

Under the Hood

The service used to perform the crawl made heavy use of go-stac, which has been expanded with a full set of “crawler” utilities.

The go-stac project on GitHub.com

The crawling done by go-stac is now quite robust against errors and failures, and multiple end-points can be used, with a task queue to go through large amounts of data concurrently. The validate and stats commands in the go-stac command-line interface use the same crawling backend, so it’s now a great tool to use with large STAC catalogs.

The crawl results were stored in a BigQuery table, and we’ll work to make the resulting data available to the community. It will be great to see what other insights might come from this data.

What’s Next

We hope that reporting on real world implementations helps the community understand how the STAC specification is being used. We are looking forward to running another crawl and updating the results as more implementations are rolled out and the STAC spec continues to evolve.

While the linked nature of STAC makes it possible to crawl and gather statistics on publicly accessible resources, it is not the most efficient way to summarize this information – particularly for very deep collections and relatively small page sizes. We have proposed a lightweight stats extension for reporting counts of resource types, versions, extensions, and asset types accessible in catalogs and collections. In future crawls, when the crawler encounters stats like those described in the extension, it will stop crawling deeper and increment counts based on what is advertised in the stats metadata. The stats extension is not yet finalized, so try it out with your catalogs and API implementations, and share any feedback or contributions in our issue tracker.

And please upgrade to STAC 1.0.0 and STAC API 1.0.0.rc.1 and register on stacindex.org to be included in our next crawl!