Find the right time series labeling tool for your machine learning project
The past couple of years, I have delved more heavily into supervised and unsupervised machine learning projects involving time series. Particularly, I’ve done a lot of work in changepoint detection (see here), as well as individual data point classification. During this time, I’ve found the process of labeling time series data sets for supervised or unsupervised ML grueling and largely ignored by the data science community. Today I present some of the tools I’ve explored for labeling time series data which make the process less painful and more automated.
First off—why would I want to label time series data? What does that even mean?
Time series labeling is largely focused on classification tasks. For example, you want to classify a certain data point or data segment in a time series representing a particular type of behavior.
As an example, I’ll use an open-source example from a project I worked on, where I needed to label flat-line behavior for a particular phenomenon in solar power time series data. The flatline behavior could be variable, with most cases occurring at the highest value(s) of the day. I needed to manually review data points on a daily basis, label them, and use this labeled data to build an algorithm that automatically classified data points. See below for a labeled Plotly graphic which shows a manually labeled solar power time series, with flat-line behavior in yellow (a label of 1), and all other behavior in blue (labeled as 0).
For labeling the example above, I needed a tool that allowed me to:
- Use at least a binary labeling scheme (0 or 1). Although the task above is strictly a binary labeling task, I also wanted a tool that could handle a multiclass labeling scheme
- That allowed me to label data points individually when performing data labeling tasks
- Wasn’t too difficult to operate/had an intuitive UI
Although the above needs are fairly basic, I spent a significant amount of time trying to find a tool that met all three requirements.
Reviewing the Resources
During my Google search on the topic, a few particular tools kept popping up, in particular the following:
- Geocene TRAINSET
- Label Studio
Below I give a brief overview of each of these tools.
The TRAINSET Geocene tool is a lightweight web application for time series labeling, and was built specifically to aid in the development of time series classification training sets (https://trainset.geocene.com/).
The Geocene TRAINSET tool has a couple of great features, including:
- It’s completely free to use
- It doesn’t require any licensing or downloads
- It allows for binary and multiclass labeling
The trickiest part of using the tool is the time series formatting. You have to have EXACT formatting when loading data into the tool or it will fail. This includes the order of the columns, which must have the following order: series, timestamp, value and label. Some example time series data in the correct format is shown below:
Pay particular attention to the timestamp formatting. It can expressed in the above the format as so:
Once you successfully load data into the tool, you can highlight data to label it, as show below:
You can also select the ‘+’ button next to the ‘Label’ field to add new labeling classes. This is especially useful if you need to do multiclass labeling:
Overall, this tool’s UI is incredibly intuitive and allows the user to parse through a significant amount of time series data within a short period of time (with minimal interruptions).
Label Studio is also an open-source data labeling tool for labeling a wide variety of different types of data, including time series data. It does have an enterprise version, which I have never tried and won’t be outlining here. You can download and initialize the Label Studio package in an Anaconda virtual environment via the following commands:
conda create –name label-studio
conda activate label-studio
pip install label-studio
The package then prompts you to sign up for the community edition:
After signing up, create a new project in the application called ‘Time Series sample’, and save it:
Next, you can import some pre-annotated time series data, available directly from the Label Studio package. The data set in question is available via this link (https://app.heartex.ai/samples/time-series.csv?time=None&values=first_column):
I struggled to find additional data examples, so I’m still not totally sure about the CSV formatting for loads. After 20 minutes of Google searching, I still could not find a dedicated section in the documentation for example data sets, which is why I am using this example.
After downloading the data, you have to directly configure the label interface for the time series data. You can adapt a pre-existing label interface format to create a binary labeling scheme, as so:
<Header value=”Time Series classification”
<TimeSeriesLabels name=”label” toName=”ts”>
<TimeSeries name=”ts” value=”$csv” valueType=”url”>
Save the label interface, and now it’s finally time to label the data! Click on the time series on the main page, and the labeling GUI will pop up. Select a classification, and drag your mouse across a section of the time series to label it accordingly. The example labeled data set in the labeling GUI is shown below:
Although the label interface requires some effort to set up correctly, adding new labels is fairly easy, by just appending to the label list. For example, a label list that looks like this:
Creates 3 multiclass labels: 0, 1, and 2. The drag-and-drop interface is also fairly intuitive, even though the process for data uploading is quite strenuous.
Overall Winner: Geocene
For time series labeling tasks, Geocene hands down beats Label Studio, largely based on its ease of use. I have no doubt Label Studio allows for more dynamic labeling structures and can handle many different labeling tasks (not just time series data), but tool complexity and documentation holes make it difficult to use. The number of steps required to get to the actual labeling task is much lower in Geocene than in Label Studio, which is great for casual users like me. Essentially this boils down to a user experience comparison, with Label Studio being a ‘breadth’ tool (can cover many different types of data labeling tasks, but requires much larger learning barrier-to-entry) and Geocene TRAINSET being a ‘depth’ tool (easy to use, but only optimized for time series labeling tasks).