Saturday, 25 June 2016

Adding AstroPy Quantities

Welcome To Quantities


After getting the TimeSeries factory working and proving the TimeSeries functionality works with real data it was time to implement another of the key features missing from the old LightCurve class.


--- Queue Drumroll---

AstroPy Quantity support!

With the rest of SunPy moving to support astropy Quantities, that allows for the easy conversion of data from one unit to another the LightCurve was starting to look pretty dated.
In the TimeSeries class I included space for a units dictionary, the basic design would use it to store a series of key/value pairs, where the key matches the column name/title and the value is the associated unit. This would give us rudimentary unit functionality without modifying the Pandas DataFrame class (which would be a very complex task) and we can easily integrate this into other functions to allow more easy access.


To implement this there were a number of changes that needed to be made, firstly I needed to manually define the units for each column in an instruments source files, next I needed to add the ability for the parse_file methods to send the units dictionary out to the factory and for the constructors to include it.
This gave me working units for the TimeSeries instrument data.
  

Manually Creating TimeSeries Objects

Something that hadn’t been fully implemented was the ability to manually create a TimeSeries given some source data. To fix this I needed to change the code I permanently burrowed from the Map Factory to adjust for the differences of a time series.
Firstly the time series is more likely to take a pandas DataFrame object then an array for the data. In-fact, after a discussion with my supervisors it was decided that it’d be sensible to accept the data as an Numpy Array, DataFrame or astropy Table. This is pretty easy as pandas includes code for using arrays and the Table class has a to_pandas method.
The units will also often be included, ideally as an optional argument. This means that I needed a way to check a dictionary is a valid units dictionary and so I created a _validate_units method in the factory.
So with these changes made we have the ability to make TimeSeries using any arbitrary data:
  >>> a = [1, 4, 5]
  >>> b = [2.0, 5.0, 8.2]
  >>> c = ['x', 'y', 'z']
  >>> t = Table([a, b, c], names=('a', 'b', 'c'), meta={'name': 'first table'})
  >>> df = t.to_pandas()
  >>> ts_from_table = sunpy.timeseries.TimeSeries(t,{})
  >>> ts_from_df = sunpy.timeseries.TimeSeries(df,{})

In this case units are extracted from the astropy Table, but otherwise you can manually specify using the units kwarg:
  >>> units = OrderedDict([('a', u.Unit("ct")),
                   ('b', u.Unit("ct")),
                   ('c', u.Unit("ct"))])
  >>> ts_from_df = sunpy.timeseries.TimeSeries(df,{} , units)

We can peek this:
  >>> ts_from_table.peek()


The quantity(col_name) Method

With this units data stored in the TimeSeries, we could then implement a method like the quantity property that allows you to extract a column of values from an astropy Table as an astropy Quantity object.
For a TimeSeries this can be done as:
ts_from_table = ts_eve.quantity(‘b')
print(qua)

So that about wraps up the current work. Fingers crossed I’ll have a gallery tutorial for all to see in the meeting on Wednesday. :)

Friday, 24 June 2016

Getting The Factory Working

After a less then productive week, with 2 drives failing on me and so a lot of windows reinstalling, I have finally started getting some real progress.

Welcome TimeSeries Factory

With the factory class mostly there but not actually working, I needed to spend some time debugging.
The issue there is that there is a lot of rather complex code in the factory, not least because I stole… erm… Burrowed code form the Map Factory and it is a rather complex process to follow.

Essentially the factory takes a series of arguments like filepaths, data/header pair, folders and/or the kitchen sink; then returning either a time series or a list of time series’.
To do this it has to evaluate each input, then call the relevant source class constructor.
In contrast to (image) maps, lightcurves and time series data are often stored in a diverse array of filetypes, some which don’t make it possible to auto detect the instrument source of the file.
So in this case we had to manually declare a source and change the Map Factory code a bit so it doesn’t try to read files it can’t.

With this done I still had issues and so I got to simple debugging.
So basically throwing in print statements at just about every other line of code to figure out what each variable holds and where my code stops.
It was messy, but it worked and hopefully by the time you read this the code will be clean and print statement free!


Making Source Classes Work

So after the factory was running correctly and able to select the appropriate source class I then needed to tweak the instrument source class files so that the read in the files correctly.
Again a few bugs emerged, but primarily the this was pretty simple as most of the work was already done with the original LightCurve class.

The result is that now we have the ability to generate a TimeSeries from any of the listed sources in a tiny amount of code, for example, using the sample data:
  >>> ts_goes = sunpy.timeseries.TimeSeries(sunpy.data.sample.GOES_LIGHTCURVE, source='GOES')

And because I had the LightCurve code, I could also use the old peek() method with no additional changes:
  >>> ts_goes.peek()


Meeting


So with this working I showed my advisers and we went through some more of the general TimeSeries functionality, where I showed them how it can concatenate, truncate and resample data just like this:
  >>> ts_goes_trunc = ts_goes.truncate('2012-06-01 05:00','2012-06-01 06:30')
  >>> ts_goes_trunc.peek()

  >>> downsampled = ts_goes_trunc.resample('10T', 'mean')
  >>> downsampled.peek()

  >>> upsampled = downsampled.resample('1T', 'ffill')
  >>> upsampled.peek()

And with (almost) no bugs along the way I’m left feeling pretty happy that the project is moving in the right direction.

Friday, 10 June 2016

First Week – First Meeting

Kicking the project off to a controversial start, in a meeting with my supervisors Stuart and Danny we decided to shock all by throwing away the idea or refactoring the LightCurve and simply creating a totally new datatype, called the TimeSeries.
OK, to clarify, we actually discussed this at the SunPy developers meeting a while back, it was pretty unanimously agreed that using the term lightcurve as the name for all possible time series data (including non-light data) was confusing. The actual decision of if a LightCurve class will be defined as a child of the TimeSeries class is an open debate, primarily for it’s SEO and documentation advantages (lightcurve is a well known scientific term), but that can be easily implemented at a later stage.

Further to this, the meeting discussed a lot of details of the project, potential directions for it to head and the general areas of expertise of my supervisors. Though admittedly Nothing much more special came from it as it was more about orientation.

Meet The GenericTimeSeries

But after the meeting the fun (or maybe just the) coding started.
I started with creating a basic class to hold 3 items:

  1. Data: as a Panda DataFrame,
  2. Meta: as an ordered dictionary,
  3. Units: as another dictionary to store the AtsroPy Quantity related to each column with index keys matching the corresponding column name.

I then went to work replicating the main functionality of the ideal Lightc… erm… TimeSeries API as specified here:
    github.com/sunpy/sunpy/issues/1520
Using direct access to the data DataFrame.

Once this was complete, and I had some basic tests for the general functionality, I tried to incorporate the functionality within methods for the TimeSeries class, matching the old LightCurve class interface where possible.
Note: we expect to radically alter the interface away from that of the old LightCurve, but it’s a good place to start.

Generally the coding was pretty easy, a little bit of thought made implementing truncation using either SunPy time ranges, Pandas date strings or basic integer slicing all work together. I also spent some time on the basic object parameters, now the date of the TimeSeries defaults the date of the first entry, for example.
So now we have a new, if rather rudimentary, class to start building upon.
Note: the use of the GenericTimeSeries name is to keep in-line with the SunPy GenericMap, this allows us to call the factory using TimeSeries(args) and leads to a simple user interface.

Next step, making some sub-classes for the instruments.

Preporations for the Lightcurve Refactor Project

Before starting the LightCurve Refactor project I did a bit of research on the problem, both to make my application stand out and to give me a bit of a head start.
It seems only sensible that I present what I found.

Issues To Fix

The old LightCurve class suffers from 4 major functionality issues:

Redundant Code

Being made before UniDown and with a wide variety file formats for multiple instruments being added over time, there is a lot of now redundant code that is to be removed.

Inconsistent Interface

Each instrument gets its data from different sources and often with different filetypes that aren’t always distinguishable. Likewise the query requirements for some of these data resources often require different parameters. This has led to a rather inconsistent interface, where you have to manually call a different constructor for each instrument, i.e:
  >>> lc = LYRALightCurve.create(LYRA_filepath)
  >>> lc = RHESSISummaryLightCurve.create(RHESSI_filepath)
Where-as all maps are all created using the same map factory:
  >>> map = Map(any_filepath)
This approach should be replicated in the new datatype.

Inability to Concatenate

Often users need to evaluate data with a time range spanning multiple files, the LightCurve class can’t work with this.

Astropy Unit Incompatible

While the rest of the SunPy library has moved to support Astropy Units, which enable easy comparison of data with different units, the LightCurve class doesn’t have any support.

Lack of Functionality

At the moment, despite the underlaying Pandas DataFrames supporting robust data selection and manipulation functionality (like filtering times, resampling and such), the LightCurve lacks all of this.


The Debate - Implementing on AstroPy Tables

The LightCurve class is based on the Python Pandas DataFrame class, this class is designed to store, manipulate and visualise time series date. But this is both another dependency for SunPy (Pandas aren’t used elsewhere) and not being specifically science based the possible future support for anything similar to AstroPy Units is very unlikely.
The suggestion was made to use AstroPy Tables for storing the data, this seemed ideal, as they support Units natively, can support more advanced AstroPy Time datastructures and WCS coordinates.
But, upon investigation there was a major floor with the AstroPy tables, essentially to support AstroPy Time you need a Table with Mixing Column, and seeing as Time objects are immutable (simply an implementation decision) then this stops you from being able to re-order or append to a column containing Times. This basically meant that re-arranging or adding to a table means completely recreating it, at best memory and processor intensive and at worst simply impractical.
There is an outstanding request for a TimeSeries datatype based on AstroPy Tables:
    github.com/Cadair/astropy-APEs/blob/master/APE9.rst
But for now the decision was made to keep the implantation using Pandas.

This gets you up to speed on the problem and the early work I did.