Thursday 7 July 2016

The Metadata Mess

This week has been a bit of a detour and here’s why. For those of you not familiar with the way scientific data and observations are stored they generally use FITS files with the image/data and a header containing metadata. This metadata is essentially a form of key/value pair containing details such as the name of the instrument, telescope, wavelength of observation, condition and calibration of the sensors, details of the observation direction and more.
Note: in SunPy metadata is generally stored in a form of dictionary formally named MapMeta that is case-insensitive. For the purpose of this project that’s been renamed to the MetaDict.

For the purpose of data manipulation, you often need to access some of this metadata and likewise it’s generally good practice to keep a record of it for your data for traceability, essentially allowing others to see the source of data used in your work.
Every file has metadata and for a single time series developed for a single file that’s easy to manage.
The issue comes when you start to combine time series together. In the easiest case you may want to simply combine multiple files from the same source instrument into a longer time series, for example NOAA GOES files only give 24 hours of data, so if you want a time series across n days you get n metadata dictionaries. In this case most of the metadata should be the same, but some calibration data may not.
Concatenating data from multiple sources is where the real problems lie, because then the metadata between columns (which are from different sources) may vary considerably.

So in the Refactor Project this whole problem was generally limited to the statement “consider how to deal with metadata”, which sounds nice and short, but in the end the solution has taken me the week to develop.

Welcome TimeSeriesMetaData Class

The solution was to develop a custom class that would allow us to store and access multiple metadata dictionaries in a logical fashion. The snappy named TimeSeriesMetaData class (came up with that all on my own) simply has a list of 3-tuples, one for each of the metadata entries.
In these 3-tuples you have a time range (sunpy.time.timerange.TimeRange) for which the metadata is relevant and a list of column names (strings) that tell which columns in the data it is relevant for.
This generally gives you enough data to uniquely identify the metadata source for any data cell within the TimeSeries.data DataFrame.
Note: the exception to this would be if you had interleaved rows matching 2 different metadata but in the same column, which is expected to be a rare case and bad practice.

TimeRange Tweaks

The SunPy TimeRange object stores a start and end datetime, so logically two TimeRange objects are equivalent if they have the same values for these two dates. In implementing the TimeSeriesMetaData class it was necessary to check if TimeRanges matched but the python == notation simply checked if you were referencing the same object. In this case I needed to implement a new __eq__ method on the TimeRange class to add this functionality.

Methods of Meta(data)

The primary aim was to try and follow the standard methods used with a standers (one) dictionary metadata object, but in this case we had to accept any simple query may come back with more then one result. For example, simply querying the telescope will likely come back with more than one result if you have multiple sources. So we need to return a collection object, generally a list.
Appending to the list is pretty simple, add a tuple with the desired TimeRange, column name list and MetaDict. Though I did make the append method add in chronological order (from the start of the time ranges). Concatenating two metadata objects simple invoked append with all metadata entries.
To allow you to find the specific metadata you want I added optional filters for the datetime and column name. With one of these you can often narrow down to one result, but with both of these filters in use you should be guaranteed for only one.
With these filters in hand I created a find method to allow you to find the metadata given filters, this was then used to build the get method, this allows you to return values from the metadata dictionaries using the key (for example ‘telescop’ to get the satellite/telescope) and using the filters this will give you a list of results matching your criterial.
Note: I tried to match the dictionary.get(key, default) API for this, then filters are accessed using keywords.
A similar methodology was used to implement an update method that allows you to add the key value pairs from a dictionary to the existing metadata dictionary, though in this case we may be updating multiple dictionaries at a time, depending on the use of filters.
Note: adding to the metadata is definitely a valuable tool, it allows you to add notes, however editing the original metadata is generally not needed and so I added an overwrite kwarg that defaults to False to protect the metadata if the user generally only wants to append.

Sanitising the Metadata

Adding the ability to remove columns (simple removing all references of a name from the column name lists) and the ability to truncate the time ranges (removing any metadata outside of a given range) were necessary to enable us to clean up the metadata in the event you truncate or extract a TimeSeries.

And The Result

So from that small note, “consider how to deal with metadata”, I have had a pretty busy week. On the plus side the TimeSeriesMetaData object is now functional and solves all the general issues we had with the combined metadata from multiple sources in a rather easy to use but powerful way.
Showing it in a meeting with my supervisor I think he was suitably happy and impressed with the result.
So now I can get back on to the time series and making tests!

No comments:

Post a Comment