stumpy icon indicating copy to clipboard operation
stumpy copied to clipboard

Add Contrast Profile Tutorial

Open ken-maeda opened this issue 2 years ago • 38 comments

Pull Request Checklist

#465 reproducing the paper for tutorial.

Below is a simple checklist but please do not hesitate to ask for assistance!

  • [x] Fork, clone, and checkout the newest version of the code
  • [x] Create a new branch
  • [x] Make necessary code changes
  • [ ] Install black (i.e., python -m pip install black or conda install -c conda-forge black)
  • [ ] Install flake8 (i.e., python -m pip install flake8 or conda install -c conda-forge flake8)
  • [ ] Install pytest-cov (i.e., python -m pip install pytest-cov or conda install -c conda-forge pytest-cov)
  • [ ] Run black . in the root stumpy directory
  • [ ] Run flake8 . in the root stumpy directory
  • [ ] Run ./setup.sh && ./test.sh in the root stumpy directory
  • [ ] Reference a Github issue (and create one if one doesn't already exist)

ken-maeda avatar Mar 02 '23 10:03 ken-maeda

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Codecov Report

Patch coverage has no change and project coverage change: -0.13 :warning:

Comparison is base (a4bb1e1) 99.25% compared to head (5407968) 99.12%.

:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
- Coverage   99.25%   99.12%   -0.13%     
==========================================
  Files          82       83       +1     
  Lines       13101    13898     +797     
==========================================
+ Hits        13003    13776     +773     
- Misses         98      122      +24     

see 50 files with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar Mar 02 '23 10:03 codecov-commenter

@ken-maeda Thank you for this contribution. Please allow me some time to review

seanlaw avatar Mar 03 '23 13:03 seanlaw

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-03-07T11:21:35Z ----------------------------------------------------------------

T(+) requires at least two behaviors.

Can you explain why it requires at least two behaviors?

Maybe "behaviors" isn't the right word and you mean "at least two instances of the positive case"?


View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-03-07T11:21:36Z ----------------------------------------------------------------

Line #1.    ecg_df = pd.read_csv("14172m.csv", index_col=0)

Instead of repeating astype(float) so many times later, you should just do:

ecg_df = pd.read_csv("14172m.csv", index_col=0, usecols=[1]).astype(float) 

Also, it would be nice just to show or print out what is in ecg_df.head() . What does the dataframe look like?


View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-03-07T11:21:37Z ----------------------------------------------------------------

Can we add some comments on what we are looking at? Why does the bottom look so much more regular and with a repeated pattern?


View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-03-07T11:21:39Z ----------------------------------------------------------------

Line #1.    v_query = ecg_df.iloc[5930:5930+127, 1].values.astype(float)

Why is the window size 127 and not 128?


View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-03-07T11:21:40Z ----------------------------------------------------------------

I don't understand the significance of this. Your point isn't clear. Where did v_query come from? Why should the reader care about that?

It would be useful to discuss the bottom plot (distance profile) and how to interpret it.

Why is it useful/important to show distance profile?


ken-maeda commented on 2023-03-07T12:08:54Z ----------------------------------------------------------------

v_query is typical norml ecg query we can see everywhere in dataset. so it is picked up one randomly.

The purpose of this distance profile is finding desired behavior by just comparing v_query(typical ecg signal) with desired behavior(rare signal). As assumption, it could be highest in distance profile. But it didn't happen.

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-03-07T11:21:41Z ----------------------------------------------------------------

What is "plato"?


ken-maeda commented on 2023-03-07T12:10:04Z ----------------------------------------------------------------

I add more descriptino to


Contrast Profile

The subsequence in 𝐓(+) corresponding to the highest point in the Contrast Profile is called the Plato.

v_query is typical norml ecg query we can see everywhere in dataset. so it is picked up one randomly.

The purpose of this distance profile is finding desired behavior by just comparing v_query(typical ecg signal) with desired behavior(rare signal). As assumption, it could be highest in distance profile. But it didn't happen.


View entire conversation on ReviewNB

ken-maeda avatar Mar 07 '23 12:03 ken-maeda

I add more descriptino to


Contrast Profile

The subsequence in 𝐓(+) corresponding to the highest point in the Contrast Profile is called the Plato.


View entire conversation on ReviewNB

ken-maeda avatar Mar 07 '23 12:03 ken-maeda

I appriciate your feedback, I fixed those.

ken-maeda avatar Mar 07 '23 12:03 ken-maeda

@ken-maeda I have a suggestion for you.

I think it is better to develop the notebook section by section. So, for each section that you add, you can wait to get some feedback and then apply those, and then after getting the green light, you can move forward. Right now, you may know what is going on in your notebook, however, the main goal is to make sure the reader can understand what is going on! You can keep the current notebook somewhere in your local pc. Then, you can start again by just providing the first section or a couple of sections. So, your notebook should only contain a couple of parts in the beginning. Then, you can add sections to it step by step.

Currently, if I do not understand a part of your notebook, I try to read other parts to better understand the concept. However, this is not desirable. I think this is a red flag. There should be a flow in your tutorial and I believe each segment should be understandable on its own.

Also, the text is as important as the code. In fact, I think it is more important particularly in tutorials. So, try to be extra careful when you explain a concept. You want to be crystal clear in every single step as much as possible.

NimaSarajpoor avatar Mar 08 '23 02:03 NimaSarajpoor

Regarding contrast profile, this is how I see it:

we can see each subsequence of length m as a data point in $R^{m}$ space. For the sake of visualization, let's illustrate the problem in 2D space.

First, let's review the definitions of T(+) and T(-).

T(+) : contains at least two instances that are unique to the phenomena of interest.
T(-) : contains no instances of interest (and instead, I think we should say it contains the regular, obvious patterns in `T(+)`)

Before we talk about T(-) and T(+), it is better to just talk about T. Let's assume the figure below shows the subsequences of T.

image

If I look at this data, I can see that the regular, obvious pattern is where the crowded part is. However, the motif we might be interested in can be the motif pair (A, B). Note that this motif pair may not be easily captured as their distance is greater than the distance of any other point and its nearest neighbour.

So, what can we do? We can create T(-) which just contains the regular behaviour of our Data. We then use T(+) to denote the remaining ones.

image

Now, we can see that the d = dist(A, A_nn_in_Tneg) - dist(A, A_nn_in_Tpos) has a high value. Let's call it contrast distance. The contrast profile, cp, is an array where cp[i] is the contrast distance that corresponds to the i-th subsequence in T(+). The peak of this contrast profile can reveal the motif pair (A, B).


Question: Can we see it as twin-freak problem? In other words, this might be an anomaly that appears more than once. So, we can easily find the motif pair (A, B) by finding the subsequence that has the greatest distance to its 2nd nearest neighbour.

Answer: I do not know! I think a good way to investigate this is to get some data and find a pair of subsets using each of these two approaches: (1) twin-freak (2) contrast profile, and see if they result in different outcomes.

NimaSarajpoor avatar Mar 08 '23 03:03 NimaSarajpoor

@NimaSarajpoor I'm sorry for causing trouble, and greatful your kind guidance. I uploaded new notebook first section of Tutorial_Contrast_Profile2.ipynb. I should have considered the contrast profile concept more.

I added its scatter plot should best to understand the "constrat" concept to the notebook.

T(+) : contains at least two instances that are unique to the phenomena of interest. T(-) : contains no instances of interest (and instead, I think we should say it contains the regular, obvious patterns in T(+))

This might be tricky precondition, the robustness for this precondition also is argued. I thought the notebook should be enoguh only for expalining contrast profile conceptl.

ken-maeda avatar Mar 08 '23 04:03 ken-maeda

@ken-maeda

I'm sorry for causing trouble, and greatful your kind guidance.

No need to be sorry. I provided a few comments. Let's start with those. Please do not add any new section. Let's take care of the current sections first. Please feel free to discuss something if you feel there is a need for that.

NimaSarajpoor avatar Mar 09 '23 05:03 NimaSarajpoor

@ken-maeda Instead of uploading a .png file, I would prefer if you could add the code that could create/recreate the image and have it inside the notebook

seanlaw avatar Mar 10 '23 01:03 seanlaw

@ken-maeda I haven't reviewed the whole notebook yet. Let's start with the current comments and some of the previous comments if you haven't addressed them yet.

Again, if you are in doubt, or you think there is a need to discuss something, please feel free to discuss it. When you are done, please let me know.


Also: Please pull the latest changes to your branch.

NimaSarajpoor avatar Mar 12 '23 04:03 NimaSarajpoor

I fixed these typos and variables name. Then I rearranged the order I explain in introduction.

ken-maeda avatar Mar 12 '23 08:03 ken-maeda

@ken-maeda I reviewed up to the section Loading the ECG data for Contrast Profile. Whenever you address a comment, you can go to ReviewNB (see the top of this PR) and then click on "Resolve Conversation" whenever you are done with that comment.


I think you have done a great job so far, and the notebook becomes more and more clear.

NimaSarajpoor avatar Mar 24 '23 00:03 NimaSarajpoor

@NimaSarajpoor I appriciate your kind feedback, I updated the notebook markdown.

ken-maeda avatar Mar 27 '23 11:03 ken-maeda

@ken-maeda From my point of view, things are good so far. My only comment is the last figure as I think it is a little bit crowded (and I am not sure if there is a better way to break it down or not). Also, you may want to revise the last sentence. I think the last sentence talks about the desirable patterns but what you should have discussed is that the discovered discords via matrix profile (shown in "orange") is not desirable and someting like that.

NimaSarajpoor avatar Apr 12 '23 04:04 NimaSarajpoor

@NimaSarajpoor I changed to separate last plot(maybe redundant?), as you mentioned, it was crowded it is hard to recognized where is indicated. I fixed the last sentence also.

ken-maeda avatar Apr 20 '23 03:04 ken-maeda

@ken-maeda I provided my final touch on the notebook. After addressing the comments, we should see what @seanlaw thinks about this notebook.

If everything is okay, we can then move forward and add the part where you compute the contrast profile and show that it can discover the subsequences of our interest.

@ken-maeda How do you feel about the progress so far?

NimaSarajpoor avatar Apr 23 '23 00:04 NimaSarajpoor

@NimaSarajpoor I'm sorry for the delay, I have fixed the notebook in the point you mentioned. I hope the notebook I created is fine now.

ken-maeda avatar May 14 '23 13:05 ken-maeda

@ken-maeda Thank you for addressing the comments. While there is still some room for improvement, we can do it later. I think @seanlaw can take a look at the notebook now and see if he has any opinion / suggestion.

@seanlaw Do you have any comment on the second notebook, docs/Tutorial_Contrast_Profile2.ipynb ?

NimaSarajpoor avatar May 15 '23 01:05 NimaSarajpoor

Let me find some time to provide some comments

seanlaw avatar May 15 '23 10:05 seanlaw

Apply stump to the entire time series what is the simplest thing that the user would have tried?

Regarding the point whether it is natural to calculate the Matrix Profile for the entire series first, the characteristic we are trying to find this time is neither motif nor discord, so I feel there is no motivation to calculate stump directly. Therefore, even if you simply calculate it for the entire series, I don't think there is much to say from the event itself, so I compared when there is a single discord and when there are multiple discords, and built on what can be said from there. It might be better to make it easy to understand what I'm trying to do from the very beginning.

ken-maeda avatar May 21 '23 11:05 ken-maeda

Regarding the point whether it is natural to calculate the Matrix Profile for the entire series first, the characteristic we are trying to find this time is neither motif nor discord, so I feel there is no motivation to calculate stump directly. Therefore, even if you simply calculate it for the entire series, I don't think there is much to say from the event itself, so I compared when there is a single discord and when there are multiple discords, and built on what can be said from there. It might be better to make it easy to understand what I'm trying to do from the very beginning.

I think the point is that stump will not be able to help you here precisely because the subsequence is neither a top motif or a top discord. Certainly, if you traverse down to the top-N motifs then you might eventually find it. I think it is important to motivate "why" computing the full matrix profile is not enough and also demonstrate its ineffectiveness for this particular problem (i.e., when the subsequence of interest does not have a nearest neighbor that is as close as other motifs)

seanlaw avatar May 21 '23 17:05 seanlaw

I think it is important to motivate "why" computing the full matrix profile is not enough and also demonstrate its ineffectiveness for this particular problem

  1. It is challenging to set a goal prior to calculating the Matrix Profile across the whole data. When we have a signal pattern we want to find, searching for motifs or discords with the Matrix Profile may not seem natural. Users might question what to do in such cases and may find it unnatural to take action using the naive Matrix Profile.

  2. Analyzing the result of applying the Matrix Profile to the whole data is difficult. As you mentioned, whether we can find what we're looking for largely depends on parts of the signal other than the current characteristic. Therefore, it's hard to say from the results what would be better from the perspective of the current characteristic .

  3. Elements that should be explained in the introduction and elements that can be explained. I want to determine that. Currently, the overall flow is: 3-1. If a discord is included once, it can be found. 3-2. If a discord is included twice, it cannot be found. 3-3. So, what should we do? Regarding this flow, I thought that it would be better to write more concisely at the beginning about what the Contrast Profile brings. However, what do you think should be written in the introduction?

ken-maeda avatar May 22 '23 16:05 ken-maeda