0

I have following set of data (pandas.DataFrame) which I would like to use scipy.interpolate.UnivariateSpline to fit. Let's call the data data.

Date
2018-04-02 09:00:00     16249
2018-04-02 10:00:00     45473
2018-04-02 11:00:00     32050
2018-04-02 12:00:00     35898
2018-04-02 13:00:00     21577
2018-04-02 14:00:00     30545
2018-04-02 15:00:00     60925
2018-04-02 16:00:00     47124
2018-04-03 09:00:00     18534
2018-04-03 10:00:00     36064
2018-04-03 11:00:00     32387
2018-04-03 12:00:00     15903
2018-04-03 13:00:00     22291
2018-04-03 14:00:00     26367
2018-04-03 15:00:00     66269
2018-04-03 16:00:00     38478
2018-04-04 09:00:00     15803
2018-04-04 10:00:00     22511
2018-04-04 11:00:00     33123
2018-04-04 12:00:00     21000
2018-04-04 13:00:00     23132
2018-04-04 14:00:00     39270
2018-04-04 15:00:00    102544
2018-04-04 16:00:00    143421
2018-04-04 17:00:00       200
2018-04-05 09:00:00     23377
2018-04-05 10:00:00     52089
2018-04-05 11:00:00     99298
2018-04-05 12:00:00     24627
2018-04-05 13:00:00     33467
2018-04-05 14:00:00     26498
2018-04-05 15:00:00    114794
2018-04-05 16:00:00     44904
2018-04-06 09:00:00     12180
2018-04-06 10:00:00     41658
2018-04-06 11:00:00     64066
2018-04-06 12:00:00     12517
2018-04-06 13:00:00     12610
2018-04-06 14:00:00     43544
2018-04-06 15:00:00     65533
2018-04-06 16:00:00    123885
2018-04-09 09:00:00     13425
2018-04-09 10:00:00     38354
2018-04-09 11:00:00     59491
2018-04-09 12:00:00     21402
2018-04-09 13:00:00     24550
2018-04-09 14:00:00     25189
2018-04-09 15:00:00     67751
2018-04-09 16:00:00     16071
2018-04-10 09:00:00     35587
2018-04-10 10:00:00     58667
2018-04-10 11:00:00     41831
2018-04-10 12:00:00     35196
2018-04-10 13:00:00     22611
2018-04-10 14:00:00     23070
2018-04-10 15:00:00     40819
2018-04-10 16:00:00     20337
2018-04-11 09:00:00      7962
2018-04-11 10:00:00     23982
2018-04-11 11:00:00     21794
2018-04-11 12:00:00     16835
2018-04-11 13:00:00     16821
2018-04-11 14:00:00     13270
2018-04-11 15:00:00     34954
2018-04-11 16:00:00     15772
2018-04-12 09:00:00      8587
2018-04-12 10:00:00     47950
2018-04-12 11:00:00     24742
2018-04-12 12:00:00     16743
2018-04-12 13:00:00     21917
2018-04-12 14:00:00     43272
2018-04-12 15:00:00     50630
2018-04-12 16:00:00    104656
2018-04-13 09:00:00     15282
2018-04-13 10:00:00     30304
2018-04-13 11:00:00     65737
2018-04-13 12:00:00     17467
2018-04-13 13:00:00     10439
2018-04-13 14:00:00     19836
2018-04-13 15:00:00     52051
2018-04-13 16:00:00     99462

what I have done so far is:

import matplotlib.pyplot as plt
import numpy as np
import scipy.interpolate as interp

x = [i for i in range(1, data.size+1)]  # this gives x as an array from 1 to 82.

spl = interp.UnivariateSpline(x, data.values, s=0.5)
xx = np.linspace(min(x), max(x), 1000)  # 1000 is an arbitrary number here.
plt.plot(x, data.values, 'bo')
plt.plot(xx, spl(xx), 'r')
plt.show()

# the plot is below and it seems to be very linear and does not look like a cubic spline at all. Cubic Spline is the default.

enter image description here

when I run spl against x, others remain unchanged, which is:

plt.plot(x, spl(x), 'r')

I get following:

the only different is the y axis is topped at 14,000, which seems to mean the previous plot showed some degree of curvature. (or not?)

enter image description here

I am not sure what I am missing here but I apparently missed something. I am still very new to spline fitting in python generally.

can you tell me how I can correctly spline fit my time series above?

EDIT

upon comment from you, I wanted to add another plot to hopefully explain myself a bit better. I didn't really mean it is linear but I couldn't find a better word. To illustrate,

xxx = [10,20,40,60,80]
plt(x, data.values, 'bo')
plt(xx, sp(xx), 'r')

plt.show()

I think below plot looks reasonably linear-ish in my sense. I am guessing, probably my question should be, how scipy.UnivariateSpline really works?

does it only show the plot for the values evaluated at the points we supplied (e.g. for this plot it is xxx) ?

enter image description here

I was expecting a much smoother plot with decent curvature demonstrated. this question's answer is showing a plot that I would expect; it looks more like a plot that piece-wise cubic functions would generate, whereas mine looks, to me, and compared to that plot, linear-ish (or first order if it is more appropriate.)

stucash
  • 1,078
  • 1
  • 12
  • 23
  • Why do you say the plot looks linear? In the first plot, zoom in on the interval [20 <= x <= 25]. Does that really look linear to you? – Warren Weckesser Jul 13 '18 at 00:06
  • The spline fit looks correct and is performed correctly. I don't quite understand the problem with it. So when you say "I apparently missed something", what do you mean? What is wrong with the plot and how would you like it to look instead? – ImportanceOfBeingErnest Jul 13 '18 at 00:17
  • @WarrenWeckesser sorry for the inaccuracy. I didn't really intend to say 'linear'. I have added another plot to help explain my question. – stucash Jul 13 '18 at 13:06
  • @ImportanceOfBeingErnest thanks for your time. I probably was expecting the wrong thing from `UnivariateSpline`. I have added a link to a plot that was more like what I wanted. I typically use `r` for spline when I use `gam` to do spline in r, it looks correct and is what I expected as well. – stucash Jul 13 '18 at 13:14

1 Answers1

1

The data set you have looks more like Rexthor, the dog-bearer than something that a smooth curve can follow. You don't have an issue with SciPy; you have an issue with data.

By increasing the parameter s you can get progressively smoother plots that deviate further and further from the data, eventually approaching the cubic polynomial that is the "best" least-squares fit for the data. But here "best" means "very bad, probably worthless". A smooth curve can be useful to display a pattern that the data already follows. If the data does not follow a smooth pattern, one should not draw a curve for the sake of drawing. The data points on the first plot should just be presented as is, without any connecting or approximating curves.

The data comes from hourly reading taken from 9:00 to 16:00 (with one stray 17:00 value mixed it - throw it out.) This structure matters. Do not pretend that Tuesday 9:00 is what happens one hour after Monday 16:00.

The data can be meaningfully summarized by daily totals

Day         Total
2018-04-02  289841
2018-04-03  256293
2018-04-04  401004
2018-04-05  419054
2018-04-06  375993
2018-04-09  266233
2018-04-10  278118
2018-04-11  151390
2018-04-12  318497
2018-04-13  310578

and by hourly averages (average number of events at 9:00, across all days, etc).

Hour        Average
9:00:00     16698.6
10:00:00    39705.2
11:00:00    47451.9
12:00:00    21758.8
13:00:00    20941.5
14:00:00    29086.1
15:00:00    65627
16:00:00    65411

In these things we can maybe observe some pattern. Here is the hourly one:

hourly_averages = np.array([16698.6, 39705.2, 47451.9, 21758.8, 20941.5, 29086.1, 65627, 65411])
hours = np.arange(9, 17)
hourly_s = 0.1*np.diff(hourly_averages).max()**2
hourly_spline = interp.UnivariateSpline(hours, hourly_averages, s=hourly_s)
xx = np.linspace(min(hours), max(hours), 1000)  # 1000 is an arbitrary number here.
plt.plot(hours, hourly_averages, 'bo')
plt.plot(xx, hourly_spline(xx), 'r')
plt.show()

hours

The curve shows the lunch break and the end-of-day rush. My choice of s as 0.1*np.diff(hourly_averages).max()**2 is not canonical, but it recognizes the fact that s scales as the square of the residuals. (Documentation). I'll use the same choice for daily averages:

daily_totals = np.array([289841, 256293, 401004, 419054, 375993, 266233, 278118, 151390, 318497, 310578])
days = np.arange(len(daily_totals))
daily_s = 0.1*np.diff(daily_totals).max()**2
daily_spline = interp.UnivariateSpline(days, daily_totals, s=daily_s)
xx = np.linspace(min(days), max(days), 1000)  # 1000 is an arbitrary number here.
plt.plot(days, daily_totals, 'bo')
plt.plot(xx, daily_spline(xx), 'r')
plt.show()

daily

This is less useful. Maybe we need a longer period of observations. Maybe we should not pretend that Monday comes after Friday. Maybe averages should be taken for each day of week to uncover a weekly pattern, but with only two weeks there is not enough to play with.


Technical details: the method UnivariateSpline chooses as few knots as possible so that a certain weighed sum of squared deviations from the data is at most s. With large s this will mean very few knots, until none remain, and we get a single cubic polynomial. How large s needs to be depends on the amount of oscillation in the vertical direction, which is extremely high in this example.

  • thanks for your answer. I actually have tried increasing `s` to some ridiculous numbers. e.g. `s=100`, `s=500`. I guess I have to zoom in to see the actually smoothing effect. – stucash Jul 13 '18 at 14:01
  • what about the last plot I added ? that looks like `UnivariateSpline` only evaluated values at those points supplied and simply drew straight line to connec them all. – stucash Jul 13 '18 at 14:06
  • hmm.. when I read [this answer](https://stackoverflow.com/questions/17913330/fitting-data-using-univariatespline-in-scipy-python), where only 14 points were passed to `UnivariateSpline`, the plot looks not too bad though? only difference is mine is using `pyplot.plot`, that answer is `pylab.plot`. – stucash Jul 13 '18 at 14:23
  • Just to make sure everyone is on the same page: If you have 7 points and plot a sine curve, `x=np.array([0,1,2,3,4,5,6]); y=np.sin(x); plot(x,y)` it [does not look smooth](https://i.stack.imgur.com/LZpqX.png), simply because you sample the sine with 7 points only. If you use 1000 points instead to sample the sine, `x = np.linspace(0,6,1000)`, you will of course get [a smooth result](https://i.stack.imgur.com/jO4JV.png). – ImportanceOfBeingErnest Jul 13 '18 at 14:39
  • @ImportanceOfBeingErnest so it does look like it is the plotting that I should play more with. I'll give it a few more tries shortly. thanks a lot! – stucash Jul 13 '18 at 14:47
  • @stucash The plot with 1000 points (the first posted in the question) has enough points for an accurate plot. If it doesn't look smooth enough to you (it does to me, _considering the source data_), increasing the number of points further won't help. Except from that 5-point plot added in the edit, you plotted things correctly. –  Jul 13 '18 at 14:52
  • @user6655984 what bothers me is the first plot looks a bit overfitted to me, given it goes through every single point I plotted.. I tried using 60 points with `UnivariateSpline`, it still has kinks, probably that is not enough. – stucash Jul 13 '18 at 14:58
  • yes I agree with what you said; I admit that I didn't think hard enough before I tried to fit the data to a curve. To be honest I am still very new to time series I surely need to learn more to do it properly. The plot was like this because I didn't use the `DateTimeIndex` for `xAxis` such that I could ignore the gaps in time. But from what you said that caused problem. – stucash Jul 13 '18 at 15:14
  • Also mind that a UnivariateSpline is in general **not a fit**. Only due to the smoothing factor *s*, it becomes a fit. If `s=0`, the spline goes through every point. If you force a cubic polynomial to go through every point, it will inevitably overshoot. – ImportanceOfBeingErnest Jul 13 '18 at 15:18
  • Could you show me what should be the right direction here? what do I need to do so the plot makes sense? – stucash Jul 13 '18 at 15:18
  • and why does ignoring gap in time cause this problem? I would actually want them to be treated like 9AM comes right after 4PM because that is the period in a day I want to analyse and data only available during those hours. i.e. irregular time series. if you could elaborate on your previous point, it's much appreciated. – stucash Jul 13 '18 at 15:30
  • @stucash I expanded the answer. –  Jul 13 '18 at 16:45