HiRISE Image Patches Obscured by Atmospheric Dust Article Swipe
YOU?
·
· 2019
· Open Access
·
· DOI: https://doi.org/10.5281/zenodo.3495067
· OA: W4393689083
<strong>Overview</strong> The purpose of this dataset is to train a classifier to detect "dusty" versus "not dusty" patches within browse-resolution HiRISE observations of the Martian surface. Here, "dusty" refers to images in which the view of the surface has been obscured heavily by atmospheric dust. The dataset contains two sets of 20,000 image patches each from EDR (full resolution) and RDR ("browse" resolution) non-map-projected ("nomap") HiRISE images, with balanced classes. The patches have been split into train (n = 10,000), validation (n = 5,000), and test (n = 5,000) sets such that no two patches from the same HiRISE observation appear in more than one of these subsets. There could be some noise in the labels, but a subset of the validation images have been manually vetted so that label noise rates can be estimated. More details on the dataset creation process are described below. <strong>Generating Candidate Images and Patches</strong> To begin constructing the dataset, the paper "The origin, evolution, and trajectory of large dust storms on Mars during Mars years 24–30 (1999–2011)," by Wang and Richardson (2015), was used to compile a set of time ranges for which global or regional dust storms were known to be occurring on Mars. All HiRISE RDR nomap browse images acquired within these time ranges were then inspected manually to determine sets of images that were (1) almost entirely obscured by dust and (2) almost entirely clear of dust. Then, 10,000 patches from the two subsets of images were extracted to form the "dusty" and "not dusty" classes. The extracted patches are 100-by-100 pixels, which roughly corresponds to the width of one CCD channel within the browse image (the width of the raw EDR data products that are stitched together to form a full RDR image). Some small amount of label noise is introduced in this process, since a patch from a mostly dusty image might happen to contain a clear view of the ground, and a patch from a mostly non-dusty image might contain some dust or regions on the surface that are featureless and appear like dusty patches. A set of "vetting labels" is included, which includes human annotations by the author for a subset of the validation set of patches. These labels can be used to estimate the apparent label noise in the dataset. Corresponding to the RDR patch dataset, a set of patches are extracted from the same set of EDR images for the "dusty" and "not dusty" classes. EDRs are raw images from the instrument that have not been calibrated or stitched together. To provide some form of normalization, EDR patches are only extracted from the lower half of the EDRs, with the upper half being used to perform a basic calibration of the lower half. Basic calibration is done by subtracting the sample (image column) averages from the upper half to remove "striping," then computing the 0.1<sup>th</sup> and 99.9<sup>th</sup> percentiles of the remaining values in the upper half and stretching the image patch to 8-bit integer values [0, 255] within that range. The calibration is meant to implement a process that could be performed onboard the spacecraft as the data is being observed (hence, using the top half of the image acquired first to calibrate the lower half of the image which is acquired later). The full resolution EDRs, which are 1024 pixels wide, are resized down to 100-by-100 pixel patches after being extracted so that they roughly match the resolution of the patches from the RDR browse images. <strong>Archive Contents</strong> The compressed archive file contains two top-level directories with similar contents, "edr_nomap_full_resized" and "rdr_nomap_browse." The first directory contains the dataset constructed from EDR data and the second contains the dataset constructed from RDR data. Within each directory, there are "dusty" and "not_dusty" directories containing the image patches from each class, "manifest.csv," and "vetting_labels.csv." The vetting labels file contains a list of manually labeled examples, along with the original labels to make it easier to compute label noise rates. The "manifest.csv" file contains a list of every example, its label, and whether it belongs to the train, validation, or test set. An example ID encodes information about where the patch was sampled from the original HiRISE image. As an example from the RDR dataset, the ID "003100_PSP_004440_2125_r4805_c512" can be broken into several parts: "003100" is a unique numerical ID "PSP_004440_2125" is the HiRISE observation ID "r4805_c512" means the patch's upper left corner starts at the 4805<sup>th</sup> row and 512<sup>th</sup> column of the original observation For the EDR dataset, the ID "200000_PSP_004530_1030_RED7_1_r9153" is broken down as follows: "200000" is a unique numerical ID "PSP_004530_1030" is the HiRISE observation ID "RED7" is the CCD ID "1" is the CCD channel (either 0 or 1) "r9153" means that the patch is extracted starting at the 9153<sup>rd</sup> row (since all columns of the 1024-pixel EDR are used, no column is specified; it is implicitly always 0) <strong>Original Data</strong> The original HiRISE EDR and RDR data is available via the Planetary Data System (PDS), hosted at https://hirise-pds.lpl.arizona.edu/PDS/