Thursday 7 July 2016

The Metadata Mess

This week has been a bit of a detour and here’s why. For those of you not familiar with the way scientific data and observations are stored they generally use FITS files with the image/data and a header containing metadata. This metadata is essentially a form of key/value pair containing details such as the name of the instrument, telescope, wavelength of observation, condition and calibration of the sensors, details of the observation direction and more.
Note: in SunPy metadata is generally stored in a form of dictionary formally named MapMeta that is case-insensitive. For the purpose of this project that’s been renamed to the MetaDict.

For the purpose of data manipulation, you often need to access some of this metadata and likewise it’s generally good practice to keep a record of it for your data for traceability, essentially allowing others to see the source of data used in your work.
Every file has metadata and for a single time series developed for a single file that’s easy to manage.
The issue comes when you start to combine time series together. In the easiest case you may want to simply combine multiple files from the same source instrument into a longer time series, for example NOAA GOES files only give 24 hours of data, so if you want a time series across n days you get n metadata dictionaries. In this case most of the metadata should be the same, but some calibration data may not.
Concatenating data from multiple sources is where the real problems lie, because then the metadata between columns (which are from different sources) may vary considerably.

So in the Refactor Project this whole problem was generally limited to the statement “consider how to deal with metadata”, which sounds nice and short, but in the end the solution has taken me the week to develop.

Welcome TimeSeriesMetaData Class

The solution was to develop a custom class that would allow us to store and access multiple metadata dictionaries in a logical fashion. The snappy named TimeSeriesMetaData class (came up with that all on my own) simply has a list of 3-tuples, one for each of the metadata entries.
In these 3-tuples you have a time range (sunpy.time.timerange.TimeRange) for which the metadata is relevant and a list of column names (strings) that tell which columns in the data it is relevant for.
This generally gives you enough data to uniquely identify the metadata source for any data cell within the TimeSeries.data DataFrame.
Note: the exception to this would be if you had interleaved rows matching 2 different metadata but in the same column, which is expected to be a rare case and bad practice.

TimeRange Tweaks

The SunPy TimeRange object stores a start and end datetime, so logically two TimeRange objects are equivalent if they have the same values for these two dates. In implementing the TimeSeriesMetaData class it was necessary to check if TimeRanges matched but the python == notation simply checked if you were referencing the same object. In this case I needed to implement a new __eq__ method on the TimeRange class to add this functionality.

Methods of Meta(data)

The primary aim was to try and follow the standard methods used with a standers (one) dictionary metadata object, but in this case we had to accept any simple query may come back with more then one result. For example, simply querying the telescope will likely come back with more than one result if you have multiple sources. So we need to return a collection object, generally a list.
Appending to the list is pretty simple, add a tuple with the desired TimeRange, column name list and MetaDict. Though I did make the append method add in chronological order (from the start of the time ranges). Concatenating two metadata objects simple invoked append with all metadata entries.
To allow you to find the specific metadata you want I added optional filters for the datetime and column name. With one of these you can often narrow down to one result, but with both of these filters in use you should be guaranteed for only one.
With these filters in hand I created a find method to allow you to find the metadata given filters, this was then used to build the get method, this allows you to return values from the metadata dictionaries using the key (for example ‘telescop’ to get the satellite/telescope) and using the filters this will give you a list of results matching your criterial.
Note: I tried to match the dictionary.get(key, default) API for this, then filters are accessed using keywords.
A similar methodology was used to implement an update method that allows you to add the key value pairs from a dictionary to the existing metadata dictionary, though in this case we may be updating multiple dictionaries at a time, depending on the use of filters.
Note: adding to the metadata is definitely a valuable tool, it allows you to add notes, however editing the original metadata is generally not needed and so I added an overwrite kwarg that defaults to False to protect the metadata if the user generally only wants to append.

Sanitising the Metadata

Adding the ability to remove columns (simple removing all references of a name from the column name lists) and the ability to truncate the time ranges (removing any metadata outside of a given range) were necessary to enable us to clean up the metadata in the event you truncate or extract a TimeSeries.

And The Result

So from that small note, “consider how to deal with metadata”, I have had a pretty busy week. On the plus side the TimeSeriesMetaData object is now functional and solves all the general issues we had with the combined metadata from multiple sources in a rather easy to use but powerful way.
Showing it in a meeting with my supervisor I think he was suitably happy and impressed with the result.
So now I can get back on to the time series and making tests!

Saturday 25 June 2016

Adding AstroPy Quantities

Welcome To Quantities


After getting the TimeSeries factory working and proving the TimeSeries functionality works with real data it was time to implement another of the key features missing from the old LightCurve class.


--- Queue Drumroll---

AstroPy Quantity support!

With the rest of SunPy moving to support astropy Quantities, that allows for the easy conversion of data from one unit to another the LightCurve was starting to look pretty dated.
In the TimeSeries class I included space for a units dictionary, the basic design would use it to store a series of key/value pairs, where the key matches the column name/title and the value is the associated unit. This would give us rudimentary unit functionality without modifying the Pandas DataFrame class (which would be a very complex task) and we can easily integrate this into other functions to allow more easy access.


To implement this there were a number of changes that needed to be made, firstly I needed to manually define the units for each column in an instruments source files, next I needed to add the ability for the parse_file methods to send the units dictionary out to the factory and for the constructors to include it.
This gave me working units for the TimeSeries instrument data.
  

Manually Creating TimeSeries Objects

Something that hadn’t been fully implemented was the ability to manually create a TimeSeries given some source data. To fix this I needed to change the code I permanently burrowed from the Map Factory to adjust for the differences of a time series.
Firstly the time series is more likely to take a pandas DataFrame object then an array for the data. In-fact, after a discussion with my supervisors it was decided that it’d be sensible to accept the data as an Numpy Array, DataFrame or astropy Table. This is pretty easy as pandas includes code for using arrays and the Table class has a to_pandas method.
The units will also often be included, ideally as an optional argument. This means that I needed a way to check a dictionary is a valid units dictionary and so I created a _validate_units method in the factory.
So with these changes made we have the ability to make TimeSeries using any arbitrary data:
  >>> a = [1, 4, 5]
  >>> b = [2.0, 5.0, 8.2]
  >>> c = ['x', 'y', 'z']
  >>> t = Table([a, b, c], names=('a', 'b', 'c'), meta={'name': 'first table'})
  >>> df = t.to_pandas()
  >>> ts_from_table = sunpy.timeseries.TimeSeries(t,{})
  >>> ts_from_df = sunpy.timeseries.TimeSeries(df,{})

In this case units are extracted from the astropy Table, but otherwise you can manually specify using the units kwarg:
  >>> units = OrderedDict([('a', u.Unit("ct")),
                   ('b', u.Unit("ct")),
                   ('c', u.Unit("ct"))])
  >>> ts_from_df = sunpy.timeseries.TimeSeries(df,{} , units)

We can peek this:
  >>> ts_from_table.peek()


The quantity(col_name) Method

With this units data stored in the TimeSeries, we could then implement a method like the quantity property that allows you to extract a column of values from an astropy Table as an astropy Quantity object.
For a TimeSeries this can be done as:
ts_from_table = ts_eve.quantity(‘b')
print(qua)

So that about wraps up the current work. Fingers crossed I’ll have a gallery tutorial for all to see in the meeting on Wednesday. :)

Friday 24 June 2016

Getting The Factory Working

After a less then productive week, with 2 drives failing on me and so a lot of windows reinstalling, I have finally started getting some real progress.

Welcome TimeSeries Factory

With the factory class mostly there but not actually working, I needed to spend some time debugging.
The issue there is that there is a lot of rather complex code in the factory, not least because I stole… erm… Burrowed code form the Map Factory and it is a rather complex process to follow.

Essentially the factory takes a series of arguments like filepaths, data/header pair, folders and/or the kitchen sink; then returning either a time series or a list of time series’.
To do this it has to evaluate each input, then call the relevant source class constructor.
In contrast to (image) maps, lightcurves and time series data are often stored in a diverse array of filetypes, some which don’t make it possible to auto detect the instrument source of the file.
So in this case we had to manually declare a source and change the Map Factory code a bit so it doesn’t try to read files it can’t.

With this done I still had issues and so I got to simple debugging.
So basically throwing in print statements at just about every other line of code to figure out what each variable holds and where my code stops.
It was messy, but it worked and hopefully by the time you read this the code will be clean and print statement free!


Making Source Classes Work

So after the factory was running correctly and able to select the appropriate source class I then needed to tweak the instrument source class files so that the read in the files correctly.
Again a few bugs emerged, but primarily the this was pretty simple as most of the work was already done with the original LightCurve class.

The result is that now we have the ability to generate a TimeSeries from any of the listed sources in a tiny amount of code, for example, using the sample data:
  >>> ts_goes = sunpy.timeseries.TimeSeries(sunpy.data.sample.GOES_LIGHTCURVE, source='GOES')

And because I had the LightCurve code, I could also use the old peek() method with no additional changes:
  >>> ts_goes.peek()


Meeting


So with this working I showed my advisers and we went through some more of the general TimeSeries functionality, where I showed them how it can concatenate, truncate and resample data just like this:
  >>> ts_goes_trunc = ts_goes.truncate('2012-06-01 05:00','2012-06-01 06:30')
  >>> ts_goes_trunc.peek()

  >>> downsampled = ts_goes_trunc.resample('10T', 'mean')
  >>> downsampled.peek()

  >>> upsampled = downsampled.resample('1T', 'ffill')
  >>> upsampled.peek()

And with (almost) no bugs along the way I’m left feeling pretty happy that the project is moving in the right direction.

Friday 10 June 2016

First Week – First Meeting

Kicking the project off to a controversial start, in a meeting with my supervisors Stuart and Danny we decided to shock all by throwing away the idea or refactoring the LightCurve and simply creating a totally new datatype, called the TimeSeries.
OK, to clarify, we actually discussed this at the SunPy developers meeting a while back, it was pretty unanimously agreed that using the term lightcurve as the name for all possible time series data (including non-light data) was confusing. The actual decision of if a LightCurve class will be defined as a child of the TimeSeries class is an open debate, primarily for it’s SEO and documentation advantages (lightcurve is a well known scientific term), but that can be easily implemented at a later stage.

Further to this, the meeting discussed a lot of details of the project, potential directions for it to head and the general areas of expertise of my supervisors. Though admittedly Nothing much more special came from it as it was more about orientation.

Meet The GenericTimeSeries

But after the meeting the fun (or maybe just the) coding started.
I started with creating a basic class to hold 3 items:

  1. Data: as a Panda DataFrame,
  2. Meta: as an ordered dictionary,
  3. Units: as another dictionary to store the AtsroPy Quantity related to each column with index keys matching the corresponding column name.

I then went to work replicating the main functionality of the ideal Lightc… erm… TimeSeries API as specified here:
    github.com/sunpy/sunpy/issues/1520
Using direct access to the data DataFrame.

Once this was complete, and I had some basic tests for the general functionality, I tried to incorporate the functionality within methods for the TimeSeries class, matching the old LightCurve class interface where possible.
Note: we expect to radically alter the interface away from that of the old LightCurve, but it’s a good place to start.

Generally the coding was pretty easy, a little bit of thought made implementing truncation using either SunPy time ranges, Pandas date strings or basic integer slicing all work together. I also spent some time on the basic object parameters, now the date of the TimeSeries defaults the date of the first entry, for example.
So now we have a new, if rather rudimentary, class to start building upon.
Note: the use of the GenericTimeSeries name is to keep in-line with the SunPy GenericMap, this allows us to call the factory using TimeSeries(args) and leads to a simple user interface.

Next step, making some sub-classes for the instruments.

Preporations for the Lightcurve Refactor Project

Before starting the LightCurve Refactor project I did a bit of research on the problem, both to make my application stand out and to give me a bit of a head start.
It seems only sensible that I present what I found.

Issues To Fix

The old LightCurve class suffers from 4 major functionality issues:

Redundant Code

Being made before UniDown and with a wide variety file formats for multiple instruments being added over time, there is a lot of now redundant code that is to be removed.

Inconsistent Interface

Each instrument gets its data from different sources and often with different filetypes that aren’t always distinguishable. Likewise the query requirements for some of these data resources often require different parameters. This has led to a rather inconsistent interface, where you have to manually call a different constructor for each instrument, i.e:
  >>> lc = LYRALightCurve.create(LYRA_filepath)
  >>> lc = RHESSISummaryLightCurve.create(RHESSI_filepath)
Where-as all maps are all created using the same map factory:
  >>> map = Map(any_filepath)
This approach should be replicated in the new datatype.

Inability to Concatenate

Often users need to evaluate data with a time range spanning multiple files, the LightCurve class can’t work with this.

Astropy Unit Incompatible

While the rest of the SunPy library has moved to support Astropy Units, which enable easy comparison of data with different units, the LightCurve class doesn’t have any support.

Lack of Functionality

At the moment, despite the underlaying Pandas DataFrames supporting robust data selection and manipulation functionality (like filtering times, resampling and such), the LightCurve lacks all of this.


The Debate - Implementing on AstroPy Tables

The LightCurve class is based on the Python Pandas DataFrame class, this class is designed to store, manipulate and visualise time series date. But this is both another dependency for SunPy (Pandas aren’t used elsewhere) and not being specifically science based the possible future support for anything similar to AstroPy Units is very unlikely.
The suggestion was made to use AstroPy Tables for storing the data, this seemed ideal, as they support Units natively, can support more advanced AstroPy Time datastructures and WCS coordinates.
But, upon investigation there was a major floor with the AstroPy tables, essentially to support AstroPy Time you need a Table with Mixing Column, and seeing as Time objects are immutable (simply an implementation decision) then this stops you from being able to re-order or append to a column containing Times. This basically meant that re-arranging or adding to a table means completely recreating it, at best memory and processor intensive and at worst simply impractical.
There is an outstanding request for a TimeSeries datatype based on AstroPy Tables:
    github.com/Cadair/astropy-APEs/blob/master/APE9.rst
But for now the decision was made to keep the implantation using Pandas.

This gets you up to speed on the problem and the early work I did.

Thursday 26 May 2016

Refactoring The SunPy Lightcurve Datatype

I'm pleased to announce that the ESA Summer of Code in Space (SoCiS) program will be supporting me in refactoring the lightcurve class in SunPy.

What's SunPy

In short SunPy is an open-source solar package library designed for solar physics. Developed by a team of researchers from across the globe it's designed to make anyone able to analyse solar data and is a valuable resource for the solar physics community.
I've been working with SunPy since using it for my dissertation and have tried to contribute where I can ever since.


What's The Lightcurve

In astrophysics a lightcurve is a measure of light flux/intensity over time. In SunPy it started off as this but has essentially been used as a generic time series data class. With SunPy development concentrating on the other two major datatypes (Maps and Spectra) the Lightcurve class has been left mostly untouched and is in general need of a redevelopment.


The Refactor

My project will be to refactor the old lightcurve code so that it more closely matches the rest of the SunPy library. This will include removing redundant code, creating a whole new class hierarchy (starting with time series as a parent for the lightcurve subclass) and updating the documentation.
I'll be starting the work as soon and will be updating the blog with my updates.