By: Tim Schaub on August 31 2022
STAC-ing up! We gathered statistics on over 550 million assets. Check out the summary of our project and register to be included in our next crawl.
The number of imagery providers hosting data online has increased over the years. Search interfaces and metadata formats have also grown to describe and provide access to that data. The SpatioTemporal Asset Catalog (STAC) specification aims to provide a unified framework for describing and linking to earth observation data, with the goal of increasing the interoperability of search tools and making it easier to access the data.
Since its release in May of 2021, the specification has been gaining popularity, with massive archives of publicly accessible geospatial data being hosted with STAC compliant metadata. The USGS hosts STAC metadata for the entire Landsat Collection 2 archive. Microsoft’s Planetary Computer provides a STAC API with access to imagery archives of Landsat, MODIS, Sentinel, and more.
Planet sees STAC as a key component in making earth observation data more accessible to developers, and helping advance it is a core part of our “Open Initiatives” as previously described. STAC’s adoption has felt very rapid, but we wanted more concrete metrics to establish a baseline for its actual adoption. Building up knowledge of how data is exposed as STAC can help us understand how the standard is being used and how mature various extensions are. This, in turn, can help us prioritize what we work on to increase its uptake. So to better understand how STAC is being used, we recently deployed a crawler to visit publicly accessible STAC endpoints, recording millions of statistics about the catalogs, collections, items, and assets linked from those endpoints.
STAC in a Nutshell¶
The STAC specification describes a few different resource types: catalogs, collections, and items. Items represent the individual spatio-temporal data entries and have references to the data assets. Collections are used to group similar items. Catalogs are the top-level entry and can also be used as sub-groupings of collections or items.
STAC implementations come with two different flavors of interface: static files and API. The static flavor can be thought of as an online folder or directory of JSON documents linking to one another. The API flavor is designed to provide more efficient pagination through large collections of items and may provide more advanced search functionality.
Starting with just 27 endpoints, sourced from the amazing stacindex.org, our crawl visited over 120 million items, spread across almost four thousand collections. We gathered statistics on over 550 million assets referenced by those items. About 97% of the collections were from static catalogs, while over 99% of the items and assets came from STAC API implementations.
The numbers suggest that people use collections far more frequently in static catalogs, presumably providing a way for users to explore reasonably small batches of items. While the STAC API implementations are dominated by catalogs with relatively small numbers of collections. Those API collections include many more items compared with their static counterparts. Because the STAC API provides search capabilities, catalogs and collections don’t need to be used as navigational aids in finding relevant items.
The core STAC specification reached a stable 1.0 version in May of 2021, and the STAC API specification had its first release candidate toward 1.0 in March of 2022. Of the catalog, collection, and item resources we crawled, 88% conform with version 1.0.0. About 12% are still using version 1.0.0-beta.2. And a few stragglers are still on versions 1.0.0-rc.2, 1.0.0-rc.3, and 0.8.1.
It is encouraging to see the number of implementations that comply with the stable 1.0.0 version. Perhaps those still using an earlier version have a path to upgrade.
The core STAC specification limits itself to describing the spatial and temporal aspects of the data—specifying metadata fields for the geographic location and time range associated with the data. STAC extensions provide a way to add information specific to a given domain. Extensions exist for adding metadata specific to electro-optical data, for adding additional coordinate reference system information, for adding detail about raster bands, and more.
Of the 120 million STAC items crawled, 86% referenced at least one extension to the STAC core. In total, 46 different extensions were used by the catalogs, collections, and items visited in the crawl.
This information collected on extension usage should be useful to help give real metrics to establish the proper maturity classification of proposed extensions—suggesting which might be candidates to graduate to a stable classification or which might be candidates for deprecation due to lack of use.
STAC Asset Types¶
Assets in STAC provide access to the underlying spatio-temporal data. For STAC items representing satellite imagery, the assets could be GeoTIFF rasters. Items may have assets that refer to additional JSON or HTML documents. The STAC asset metadata includes a “type” property advertising the asset media type. Values like “image/tiff” represent a generic TIFF image; “image/tiff; application=geotiff; profile=cloud-optimized” represents the Cloud Optimized GeoTIFF variety.
|image/tiff; application=geotiff; profile=cloud-optimized||75,945,405|
At first glance, it is surprising to see the high counts of asset types that refer to additional metadata – the “application/json” and “application/xml” asset types, for example. We had anticipated that STAC assets would refer primarily to the spatiotemporal data instead of additional metadata describing that data. As it turns out, many implementations use assets to point to additional metadata that either doesn’t fit into STAC or provides an alternate representation of the same metadata found in the STAC resources. In total, just under 50% of the 550 million assets could be classified as metadata assets as opposed to actual data assets.
Considering all of its varieties together, TIFF is the most common asset type, representing over 20% of all assets crawled. About 99% of the TIFF assets are advertised as GeoTIFFs. It is possible the remaining TIFFs are GeoTIFFs as well, but simply use “image/tiff” as their asset type. Of the GeoTIFFs, over 99% are of the Cloud Optimized variety. It appears that people are still not sure about what media type is appropriate for Cloud Optimized GeoTIFF (COG). Although there is not yet an officially specified MIME type for COGs, there is growing consensus around “image/tiff; application=geotiff; profile=cloud-optimized” as a candidate for specification.
Under the Hood¶
The service used to perform the crawl made heavy use of go-stac, which has been expanded with a full set of “crawler” utilities.
The crawling done by go-stac is now quite robust against errors and failures, and multiple end-points can be used, with a task queue to go through large amounts of data concurrently. The
stats commands in the go-stac command-line interface use the same crawling backend, so it’s now a great tool to use with large STAC catalogs.
The crawl results were stored in a BigQuery table, and we’ll work to make the resulting data available to the community. It will be great to see what other insights might come from this data.
We hope that reporting on real world implementations helps the community understand how the STAC specification is being used. We are looking forward to running another crawl and updating the results as more implementations are rolled out and the STAC spec continues to evolve.
While the linked nature of STAC makes it possible to crawl and gather statistics on publicly accessible resources, it is not the most efficient way to summarize this information – particularly for very deep collections and relatively small page sizes. We have proposed a lightweight stats extension for reporting counts of resource types, versions, extensions, and asset types accessible in catalogs and collections. In future crawls, when the crawler encounters stats like those described in the extension, it will stop crawling deeper and increment counts based on what is advertised in the stats metadata. The stats extension is not yet finalized, so try it out with your catalogs and API implementations, and share any feedback or contributions in our issue tracker.
And please upgrade to STAC 1.0.0 and STAC API 1.0.0.rc.1 and register on stacindex.org to be included in our next crawl!