GRASP HEURISTIC FOR TIME SERIES COMPRESSION WITH PIECEWISE AGGREGATE APPROXIMATION

The Piecewise Aggregate Approximation (PAA) is widely used in time series data mining because it allows to discretize, to reduce the length of time series and it is used as a subroutine by algorithms for patterns discovery, indexing, and classification of time series. However, it requires setting one parameter: the number of segments to consider during the discretization. The optimal parameter value is highly data dependent in particular on large time series. This paper presents a heuristic for time series compression with PAA which minimizes the loss of information. The heuristic is built upon the well known metaheuristic GRASP and strengthened with an inclusion of specific global search component. An extensive experimental evaluation on several time series datasets demonstrated its efficiency and effectiveness in terms of compression ratio, compression interpretability and classification. Mathematics Subject Classification. 90C59. Received April 30, 2017. Accepted October 12, 2018.


Introduction
Time series databases are often large and several transformations have been introduced in order to represent them in a more compact way. One of these transformations is Piecewise Aggregate Approximation (PAA) [13], which consists in dividing a time series into several segments of fixed length and replacing the data points of each segment with their averages. Due to its simplicity and low computational time, PAA has been widely used as a basic primitive by other temporal data mining algorithms such as [16,17,25], in order • to construct symbolic representations of time series [3,26]; • to construct an index for time series [12,14,30]. Indeed, PAA allows queries which are shorter than length for which the index was built. This very desirable feature is impossible with Discrete Fourier Transform, Singular Value Decomposition and Discrete Wavelet Transform; • to classify time series. [5]. An exhaustive comparison of time series algorithms [1] shows that DTW is among the efficient techniques to be used. However, DTW has two major drawbacks: the comparison of two time series under DTW is timeconsuming [20] and DTW sometimes produces pathological alignments [15]. A pathological alignment occurs when, during the comparison of two time series X and Y , one datapoint of the time series X is compared to a large subsequence of datapoints of Y . A pathological alignment causes a wrong comparison.
Three categories of methods are used to avoid pathological alignments with DTW: • The first one adds constraints to DTW [4,11,[21][22][23]28]. The main idea here is to limit the length of the subsequence of a time series that can be compared to a single datapoint of another time series. • The second one suggests skipping datapoints that produce pathological alignment during the comparison of two time series [10,18,19]. • The third one proposes to replace the datapoints of time series with a high-level abstraction that captures the local behavior of those time series. A high-level abstraction can be a histogram of values that captures the distribution of time series datapoints in space [29] or a feature that captures the local properties of time series, such as the trend with Derivative DTW (DDTW) [15].
Another simple but yet interesting way to capture local properties of time series is to consider mean of segments of the time series as PAA does. Indeed, the use of the mean reduces the harmful effects of singularities contained in the data and thus allows to avoid pathological alignments. However, one major challenge with PAA is the choice of the number of segments to consider especially with long time series. When the number of segments considered is very small, there is a loss of information and the accuracy is reduced. However, considering all the points in the time series, we also do not obtain maximum accuracy due to the presence of noise or singularities [15] in the data.

The problem of choosing a suitable segment number for PAA
If the number of segments considered with PAA is too small, the resulting representation is compact, but it contains less information. On the other hand, if the number of segments is too large, the obtained representation is less compact and more prone to the noise contained in the original time series (Fig. 1). Our idea is that a number of segments for PAA will be considered as good if it allows obtaining a compact representation of the time series, and also if it preserves the quality of the alignment of time series. So when considering classification task, one of the best classification algorithm to use for evaluating the quality of time series alignment is one nearest neighbor (1NN). Indeed, its classification error directly depends on time series alignment, since 1NN has no other parameters [27].

Summary of contributions
In this paper, • We define the problem of preprocessing time series with PAA for a better classification with DTW.
• We propose a parameter free heuristic for aligning piecewise aggregate time series with DTW, which approximates the optimal value of the number of segments to be considered with PAA. • We make our source code and all our results available to allow the reproducibility of our experiments.
The rest of the paper is organized as follows: in Section 2 we recall the definitions and background; Section 3 explains our approach; Section 4 presents experimental results and comparisons to others methods; Section 5 offers conclusions and venues for future work.

Background and related works
Let's recall some definitions. Definition 2.1. A time series X = x 1 , · · · , x n is a sequence of numerical values representing the evolution of a specific quantity over time. x n is the most recent value. Definition 2.2. A segment X i of length l of the time series X of length n (l < n) is a sequence constituted by l variables of X starting at the position i and ending at the position i + l − 1. We have: Definition 2.3. The arithmetic average of the data points of a segment X i of length l is notedX i and is defined by: (2.1) Definition 2.4. Let T be the set of time series. The Piecewise Aggregate Approximation (PAA) is defined as follows: DTW [22] is an algorithm of time series alignment algorithm that performs a non-linear alignment while minimizing the distance between two time series. To align two time series : X = x 1 , x 2 , · · · , x n ; Y = y 1 , y 2 , · · · , y m , the algorithm constructs an n × m matrix where the cell (i, j) of the matrix corresponds to the squared distance (x i − y j ) 2 between x i and y j . Then to find the best alignment between X and Y , DTW constructs the path that minimizes the sum of squared distances. This path, noted W = w 1 , w 2 , . . . , w k , . . . , w K , must respect the following constraints: • Boundary constraint: w 1 = (1, 1) and w K = (n, m) • Monotonicity constraint: given w k = (i, j) and : w k+1 = (i , j ) then : i ≤ i and j ≤ j • Continuity constraint: given w k = (i, j) and : w k+1 = (i , j ) then : i ≤ i + 1 and : j ≤ j + 1 The warping path is computed by an algorithm based on the dynamic programming paradigm that solves the following recurrence: where d(x i , y j ) is the squared distance contained in the cell (i, j) and γ(i, j) is the cumulative distance at the position (i, j) that is computed by the sum of the squared distance at the position (i, j) and the minimal cumulative distance of its three adjacent cells. Piecewise Dynamic Time Warping Algorithm (PDTW) [14] is the DTW algorithm applied on Piecewise Aggregate time series [13]. Let N ∈ N * , X and Y be two time series: The number of segments N that one considers greatly influences the quality of the alignment of the time series. However, PDTW does not give any information on the way to choose it. For making this choice, Chu et al. [6] proposes the Iterative Deepening Dynamic Time Warping Algorithm (IDDTW).

Iterative deepening dynamic time warping
For determining the number of segments, IDDTW only considers values that are power of 2 and for each value, computes an error distribution by comparing PDTW with the standard DTW at each level of compression. It takes as inputs: the query Q, the dataset D, the user's confidence (or tolerance for false dismissals) user conf , and the set of standard deviations StdDev obtained from the error distribution. Example: Let C and Q be two time series of the dataset D, let best so f ar be the DT W distance between two time series of the dataset. Suppose the distance D pdtw (Q, D) is 40 and the best so f ar is 30. The difference between the estimated distance and the best so f ar is 10. Using the error distribution centred around the approximation (40), we can determine the probability that the candidate could be better by examining the area beyond the location of the best so f ar (shown in solid black in Fig. 2): We disqualify a candidate if this probability is less than the user's specified error acceptance, the candidate is disqualified; otherwise, a finer approximation is used and the test is re-applied to the next depth. This process continues until the full DTW is performed.
More precisely, IDDTW proceeds as follows: • the algorithm starts by applying the classic DTW to the first K candidates from the dataset. The results of the best matches to the query are contained in R, with |R| = K. The best so f ar is determined from argmaxR; • both the query Q and each subsequent candidate C are approximated using PAA representations with N segments to determine the corresponding PDTW; Figure 2. IDDTW operating principle. Depth represents approximation levels, A represents approximate distance and B is best so far [6].
• a test is performed to determine whether the candidate C can be pruned off or not. If the result of the test is found to have a probability that it could be a better match than the current best so f ar, a higher resolution of the approximation is required. Then each segment of the candidate is split into two segments to obtain a new candidate; • the process of approximating Q and C to determine the PDTW should be reapplied and the test is repeated for all approximations levels until they fail the test or their true distance DTW is determined.
In this way, IDDTW finds the number of segments that best approximates DTW and speeds up its computation. However, IDDTW has three main limitations: • it only considers the numbers of segments for PDTW that are power of 2; • it requires a user-specified tolerance for false dismissals that influences the quality of the approximation, but the algorithm does not give any indication on how to choose the tolerance; • it considers DTW as a reference while looking for the number of segments that best aligns the time series.
However, because of pathological alignments, DTW sometimes fails to align time series properly [15].
Our goal is to find the number of segments that best aligns the time series and also speeds up the computation of DTW. We propose a heuristic named parameter Free piecewise DTW (FDTW) based on Greedy Randomized Adapted Search Procedure that deals with all the limitations of IDDTW: it considers all the possible values for the number of segments, it is parameter-free and it finds a number of segments for PDTW based on the quality of the time series alignment, namely the error rate for classification task. The next section introduces FDTW.

Evaluation procedures for the compression quality
Before explaining how to evaluate the quality of time series compression, we first describe the time series datasets that we considered. They are made up of time series associated with labels that identify the shape of the latter. For instance, in the ECG dataset, each time series traces the electrical activity recorded during one heartbeat. The two classes are a normal heartbeat and a Myocardial Infarction.
Time series classification is a classic problem with time series which consists in guessing the label of an unlabeled time series based on its shape. The quality of a time series classification model is evaluated from its classification error ( ), or its accuracy (a = 1 − ). When considering classification task, one of the best classification algorithm to use for evaluating the quality of time series alignment is one nearest neighbor (1NN).
Indeed, its classification error directly depends on time series alignment, since 1NN has no other parameters [27].
During this work, a compact representation of time series is considered to be good if it reduces the length of the original time series, but also if the classification error obtained by classifying the compact time series is small. The classification error is small when the time series keep their characteristic shape despite compression.

Problem definition
Let D = {d i } be a set of datasets composed of time series. We note |d i | the number of time series of the dataset d i .
Let X ∈ d i be a time series of the dataset d i ; we note |X| = n the length of the time series X. For simplicity of notation we suppose that all the time series of d i have the same length.
is the classification error of one nearest neighbour with Dynamic Time Warping on the dataset d i . N ) is the classification error of 1-NN with PDTW using N segments on d i .
Our goal is to find the number of segments that allows PDTW to best align time series. PDTW gives a good alignment when its classification error with 1NN is low [20]. Our problem is then to find the number of segments N that minimizes 1N N P DT W (d i , N ).
Formally, given a dataset d i , whose time series have a length n, we look for the number of segments N ∈ {1 . . . n} such that

Brute-force search
The simplest way to find the value for the number of segments that minimized the classification error is to test all the possible values. Obviously, this method is time consuming as it requires to test n values to find the best one. The time complexity is : To reduce the time of the search, the FDTW proposes to look for the number of segments with the minimal classification error without testing all the possible values.

Greedy randomized adaptive search procedures
The Greedy Randomized Adaptive Search Procedures (GRASP) is a multi-start, or iterative metaheuristic proposed by Feo and Resende [8], in which each iteration consists of two phases: firstly a new solution is constructed by a greedy randomized procedure and then is improved using a local search procedure.
The greediness criterion establishes that elements with the best quality are added to a restricted candidate list and chosen at random when building up the solution. The candidates obtained by greedy algorithms are not necessarily optimal. So, those candidates are used as initial solutions to be explored by local search. The heuristic we proposed is build upon GRASP and strengthened with an inclusion of specific global search component.

Parameter free heuristic
The idea of our heuristic is the following: 2. We evaluate the classification error with 1N N P DT W for each chosen candidate, and we select the candidate that has the minimal classification error: it is the best candidate. In our example, we may suppose that we get the minimal value with the candidate 6 : it is thus the best candidate at this step. 3. We respectively look between the predecessor (i.e., 3 here) and successor (i.e., 9 here) of the best candidate for a number of segments with a lower classification error: this number of segments corresponds to a local minimum. In our example, we are going to test values 4, 5, 7 and 8 to see if there is a local minimum. . We restart at step one while choosing different candidates during each iteration to ensure that we return a good local minimum. We fix the number of iterations to k ≤ log(n) . At each iteration, the first candidate is n − (number of iteration − 1).
In short, in the worst case, we test the first M candidates to find the best one. Then, we test 2n M other candidates to find the local minimum. We finally perform nb(M ) = M + 2n M tests. The number of tests to be performed is a function of the number of candidates. Hence, how many candidates should we consider to reduce the number of tests? The first derivative of nb function vanishes when M = √ 2n and its second derivative is positive; so the minimal number of tests is obtained when the number of candidates is: M = √ 2n. At each iteration, the heuristic tests nb( √ 2n) = √ 8n number of segments. As we have k iterations the number of candidates tested is: |C| = k √ 8n. The details of the heuristic are presented in Algorithm 1. Time complexity: We use the training set to find the number of segments that should be considered with PDTW. For that purpose, we applied 1N N on the training set that costs where |d| 2 comes from 1N N algorithm and n 2 √ n comes from P DT W . Nevertheless, a heuristic does not always give the optimal value. To ensure that it gives a result not far from the optimal value, one approach is to guarantee that the result of the heuristic always lies in an interval with respect to the optimal value [9].
In our case, we are looking for the number of segments that allows a good alignment of time series. The alignment is good when the classification error with 1NN is minimal or when the accuracy is maximal.
Let d i be a dataset: acc max(di) = 1 − min To ensure the quality of our heuristic FDTW, we hypothesized that 1NNDTW is better than Zero Rule classifier. Zero Rule classifier is a simple classifier that predicts the majority class of test data (if nominal) or average value (if numeric). Zero Rule is often used as baseline classifier [7]. The minimal value of the accuracy of Zero Rule is 1 c where c is the number of classes of the dataset.

Experiment and results
Throughout the experiments described in this paper, FDTW performs three iterations (k = 3) when searching for the appropriate number of segments for a dataset. To evaluate the ability of FDTW heuristic to propose a good number of segments for PAA. It has been compared to the IDDTW algorithm in terms of: • heuristic execution speed; • time series compression ratio; • classification error associated with the number of segments found by the heuristic.

Case studies
PAA is widely used in temporal data mining and often as a primitive by other algorithms such as those allowing to construct a symbolic representation of time series, those allowing to index a time series or even those allowing to classify time series. In this section, we present some algorithms for which the pre-processing performed by FDTW allows to improve the final results.

Datasets
The experiments have been performed first on 45 datasets and then on 84 datasets of UCR time series datamining archive [5], which provided a large collection of datasets that covers various categories of domains. Each data set is divided into a training set and a testing set. The 84 datasets possess between 2 and 60 classes, the length of time series varies from 24 to 2709, the training sets contain between 16 and 8926 time series and the testing sets contain between 20 and 8236 time series. All datasets are publicly available on the UCR time series classification page.

Compression
Compression ratio: An immediate way to evaluate the quality of the segmentation is to compare the compression ratios. A segment number N 1 will be better than a segment number N 2 if it makes it possible to obtain a more compact representation with PAA. The compression ratio is given by: where n is the length of the time series and N is the number of segments considered with PAA. The closer r is to 1 the better is the compression. The numbers of segments used here are shown in Table 1. For the considered datasets, the mean compression ratio of IDDTW (r = 0.654) is slightly higher than that of FDTW (r = 0.605). However, this difference is not significant. Indeed, the wilcoxon test gives us a p-value greater than 0.1 (p > 0.1). Therefore, we cannot reject the hypothesis that the compression ratios of IDDTW and FDTW are equal.
Applicaion: PAA used with a suitable segment number allows compression of the time series of the Coffee dataset without loss of information. Although they are more compact, the obtained time series capture the main variations of the original time series (Fig. 3).

Classification
Piecewise Aggregate Approximation is used by ShapeDTW [30] and DTW F [12] to classify time series. However, to evaluate the actual impact of the segment number considered on the classification, we tested FDTW to choose the number of segments to use with 1NN and PDTW.
PDTW was designed to speed up the calculation of DTW without degrading the accuracy. Here, we observe that when the number of segments is chosen, this may even lead to an improvement of the results of the classification.
Quality of the number of segments found: A segment number N 1 is better than a segment number N 2 if the classification error associated with N 1 is smaller than that associated with N 2 . So, to evaluate the quality of our heuristic FDTW, we compared its classification errors with that of IDDTW. The classification error was calculated based on the threefold cross validation applied on the training set. IDDTW tested all the values of N that were equal to a power of two and kept the one that had a minimum classification error (Tab. 1).

Application:
According to the announcement in Lemma 3.3, the classification error of FDTW during the learning phase (training error) is less than or equal to that of DTW for all the considered datasets. We used Wilcoxon signed rank test with continuity correction to test the significance of FDTW against IDDTW. The Wilcoxon signed rank test gives a p-values, p < 0.01, which demonstrates that FDTW achieves a significant reduction of the classification error of IDDTW. This also demonstrates that FDTW allows to find segment numbers for PAA that are of better quality than those found by IDDTW during the learning phase. Comparison with IDDTW : To evaluate the quality of FDTW, we compared its classification errors with that of IDDTW and the minimal one. The minimal classification error was find by applying Brute-force search (BF) on both training set and testing set. FDTW and IDDTW used the training set to find the segment number N with minimal training error using threefold cross validation, and then used this number of segments on the testing set to compute the classification error. The value of the segment number N found on the training set may in some cases not be appropriate for the testing set. We speak of a generalization error which is due to the representativeness of the training set (Tab. 2). If two numbers of segments N 1 and N 2 are associated with the same training error, we retain the largest. IDDTW tested all the values of N that were equal to a power of two during the learning phase and kept the one that had a minimum classification error.
The experiments showed that FDTW is more performant than IDDTW. Actually, FDTW resulted in a lower generalization error than IDDTW on 22 datasets and the same generalization error than IDDTW on eight datasets. The Wilcoxon signed rank test gives a p-values, 0.01 < p ≤ 0.05, which demonstrates that FDTW achieved a significant reduction of the generalization error of IDDTW. Results also show that FDTW managed to find the minimum error for nine datasets (Coffee, ECGFiveDays, Gun-point, ItalyPowerDemand, OliveOil, Plane, Synthetic control, Trace, Two patterns) and outperforms the smallest classification error reported in the literature on dataset CBF (No. 5).

Heuristic execution speed:
As already suggested by the time complexity of FDTW and IDDTW heuristics, IDDTW tests fewer candidates than FDTW and is therefore faster. However, the number of candidates tested by FDTW reduces exponentially with the length of the time series (Fig. 4). Actually, the number of candidates to be tested ranges from 1 to n, n being the length of time series, and FDTW considers √ n candidates for each iteration. In average, FDTW is 8 times faster than Brute-force search with an average execution time of 176 minutes against 1386 min for Brute-force search. IDDTW is seven times faster than FDTW and remains the fastest with an average execution time of 24 min. The execution time increases with the length of the time series (Fig. 5). The increase of Brute-force search execution time is faster than that of FDTW and IDDTW. This is observable from the datasets Lightning-2 whose time series have a length equal to 637 data points. Note: The experiments were conducted on a PC with an Intel Core i7 processor, 16GB of RAM and a Windows 7 64-bit operating system.

Comparison with other classification algorithms:
To evaluate the quality of FDTW, we compared its classification errors (generalization error) with that of 35 other classification algorithms [2] of the literature on 84 datasets of UCR archive. The performances of the algorithms are compared using the Nemenyi test that compares all the algorithms pairwise and provides an intuitive way to visualize the results (Fig. 6). The Nemenyi test allows ranking classification algorithms according to their average accuracy on 84 datasets. FDTW obtained good results on the simulated datasets in terms of average accuracy (3rd/37 algorithms, Fig. 6) because the data of the training set and the testing set are generated by the same models.
However, to evaluate the significance of the difference between the classification algorithms on 84 datasets, we used the Wilcoxon signed rank test with continuity correction, which has more statistical power.The results of these experiments show that despite data compression,  In bold, the smallest generalization error between IDDTW and FDTW. N is the number of segments selected and is the number of data points in a segment (l = n N ). The generalization error is computed on the testing set.
Results reported in [1,5] Our Notes. DTW(r) is a constraint version of DTW where the number of consecutive data points that can be compared to a single point during the warping is bounded. r represents the size of the warping windows.   These results demonstrate the competitiveness of FDTW. Moreover, this algorithm outperforms the best result reported in the literature on UWaveGestureLibraryAll dataset (Fig. 7). The challenge with this dataset is to recognize the gesture made by a user from measurements made by accelerometers. As reported in [1] the best accuracy obtained on this dataset is 83.44% with TSBF algorithm; FDTW outperforms this result and allows to obtain 91.87% of accuracy.
Additional experiments are available here [24].

Conclusion and perspective
This paper deals with the problem of choosing an appropriate number of segments to compress time series with PAA in order to improve the alignment with DTW. In this aim, we proposed a parameter Free heuristic named FDTW, which approximates the optimal number of segments to use. The experiments showed that FDTW increased the quality of alignment of time series especially on synthetic datasets where DTW associated with PAA performed better than any other variant of DTW on a classification task and was rank 3rd/37 behind two ensemble classification algorithms COTE and EE. This algorithm allows reducing the storage space and the processing time of time series while increasing the quality of the alignment of DTW. As a perspective, the problem we have dealt with in this paper could be modeled as a multi-objective optimization problem where one objective function would be compression and the other the classification of time series.