Add Contrast Profile Tutorial
Pull Request Checklist
#465 reproducing the paper for tutorial.
Below is a simple checklist but please do not hesitate to ask for assistance!
- [x] Fork, clone, and checkout the newest version of the code
- [x] Create a new branch
- [x] Make necessary code changes
- [ ] Install
black(i.e.,python -m pip install blackorconda install -c conda-forge black) - [ ] Install
flake8(i.e.,python -m pip install flake8orconda install -c conda-forge flake8) - [ ] Install
pytest-cov(i.e.,python -m pip install pytest-covorconda install -c conda-forge pytest-cov) - [ ] Run
black .in the root stumpy directory - [ ] Run
flake8 .in the root stumpy directory - [ ] Run
./setup.sh && ./test.shin the root stumpy directory - [ ] Reference a Github issue (and create one if one doesn't already exist)
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Codecov Report
Patch coverage has no change and project coverage change: -0.13 :warning:
Comparison is base (
a4bb1e1) 99.25% compared to head (5407968) 99.12%.
:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.
Additional details and impacted files
@@ Coverage Diff @@
## main #800 +/- ##
==========================================
- Coverage 99.25% 99.12% -0.13%
==========================================
Files 82 83 +1
Lines 13101 13898 +797
==========================================
+ Hits 13003 13776 +773
- Misses 98 122 +24
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
@ken-maeda Thank you for this contribution. Please allow me some time to review
View / edit / reply to this conversation on ReviewNB
seanlaw commented on 2023-03-07T11:21:35Z ----------------------------------------------------------------
T(+) requires at least two behaviors.
Can you explain why it requires at least two behaviors?
Maybe "behaviors" isn't the right word and you mean "at least two instances of the positive case"?
View / edit / reply to this conversation on ReviewNB
seanlaw commented on 2023-03-07T11:21:36Z ----------------------------------------------------------------
Line #1. ecg_df = pd.read_csv("14172m.csv", index_col=0)
Instead of repeating astype(float) so many times later, you should just do:
ecg_df = pd.read_csv("14172m.csv", index_col=0, usecols=[1]).astype(float)
Also, it would be nice just to show or print out what is in ecg_df.head() . What does the dataframe look like?
View / edit / reply to this conversation on ReviewNB
seanlaw commented on 2023-03-07T11:21:37Z ----------------------------------------------------------------
Can we add some comments on what we are looking at? Why does the bottom look so much more regular and with a repeated pattern?
View / edit / reply to this conversation on ReviewNB
seanlaw commented on 2023-03-07T11:21:39Z ----------------------------------------------------------------
Line #1. v_query = ecg_df.iloc[5930:5930+127, 1].values.astype(float)
Why is the window size 127 and not 128?
View / edit / reply to this conversation on ReviewNB
seanlaw commented on 2023-03-07T11:21:40Z ----------------------------------------------------------------
I don't understand the significance of this. Your point isn't clear. Where did v_query come from? Why should the reader care about that?
It would be useful to discuss the bottom plot (distance profile) and how to interpret it.
Why is it useful/important to show distance profile?
ken-maeda commented on 2023-03-07T12:08:54Z ----------------------------------------------------------------
v_query is typical norml ecg query we can see everywhere in dataset. so it is picked up one randomly.
The purpose of this distance profile is finding desired behavior by just comparing v_query(typical ecg signal) with desired behavior(rare signal). As assumption, it could be highest in distance profile. But it didn't happen.
View / edit / reply to this conversation on ReviewNB
seanlaw commented on 2023-03-07T11:21:41Z ----------------------------------------------------------------
What is "plato"?
ken-maeda commented on 2023-03-07T12:10:04Z ----------------------------------------------------------------
I add more descriptino to
Contrast Profile
The subsequence in 𝐓(+) corresponding to the highest point in the Contrast Profile is called the Plato.
v_query is typical norml ecg query we can see everywhere in dataset. so it is picked up one randomly.
The purpose of this distance profile is finding desired behavior by just comparing v_query(typical ecg signal) with desired behavior(rare signal). As assumption, it could be highest in distance profile. But it didn't happen.
View entire conversation on ReviewNB
I add more descriptino to
Contrast Profile
The subsequence in 𝐓(+) corresponding to the highest point in the Contrast Profile is called the Plato.
View entire conversation on ReviewNB
I appriciate your feedback, I fixed those.
@ken-maeda I have a suggestion for you.
I think it is better to develop the notebook section by section. So, for each section that you add, you can wait to get some feedback and then apply those, and then after getting the green light, you can move forward. Right now, you may know what is going on in your notebook, however, the main goal is to make sure the reader can understand what is going on! You can keep the current notebook somewhere in your local pc. Then, you can start again by just providing the first section or a couple of sections. So, your notebook should only contain a couple of parts in the beginning. Then, you can add sections to it step by step.
Currently, if I do not understand a part of your notebook, I try to read other parts to better understand the concept. However, this is not desirable. I think this is a red flag. There should be a flow in your tutorial and I believe each segment should be understandable on its own.
Also, the text is as important as the code. In fact, I think it is more important particularly in tutorials. So, try to be extra careful when you explain a concept. You want to be crystal clear in every single step as much as possible.
Regarding contrast profile, this is how I see it:
we can see each subsequence of length m as a data point in $R^{m}$ space. For the sake of visualization, let's illustrate the problem in 2D space.
First, let's review the definitions of T(+) and T(-).
T(+) : contains at least two instances that are unique to the phenomena of interest.
T(-) : contains no instances of interest (and instead, I think we should say it contains the regular, obvious patterns in `T(+)`)
Before we talk about T(-) and T(+), it is better to just talk about T. Let's assume the figure below shows the subsequences of T.
If I look at this data, I can see that the regular, obvious pattern is where the crowded part is. However, the motif we might be interested in can be the motif pair (A, B). Note that this motif pair may not be easily captured as their distance is greater than the distance of any other point and its nearest neighbour.
So, what can we do? We can create T(-) which just contains the regular behaviour of our Data. We then use T(+) to denote the remaining ones.
Now, we can see that the d = dist(A, A_nn_in_Tneg) - dist(A, A_nn_in_Tpos) has a high value. Let's call it contrast distance. The contrast profile, cp, is an array where cp[i] is the contrast distance that corresponds to the i-th subsequence in T(+). The peak of this contrast profile can reveal the motif pair (A, B).
Question: Can we see it as twin-freak problem? In other words, this might be an anomaly that appears more than once. So, we can easily find the motif pair (A, B) by finding the subsequence that has the greatest distance to its 2nd nearest neighbour.
Answer: I do not know! I think a good way to investigate this is to get some data and find a pair of subsets using each of these two approaches: (1) twin-freak (2) contrast profile, and see if they result in different outcomes.
@NimaSarajpoor
I'm sorry for causing trouble, and greatful your kind guidance. I uploaded new notebook first section of Tutorial_Contrast_Profile2.ipynb. I should have considered the contrast profile concept more.
I added its scatter plot should best to understand the "constrat" concept to the notebook.
T(+) : contains at least two instances that are unique to the phenomena of interest. T(-) : contains no instances of interest (and instead, I think we should say it contains the regular, obvious patterns in
T(+))
This might be tricky precondition, the robustness for this precondition also is argued. I thought the notebook should be enoguh only for expalining contrast profile conceptl.
@ken-maeda
I'm sorry for causing trouble, and greatful your kind guidance.
No need to be sorry. I provided a few comments. Let's start with those. Please do not add any new section. Let's take care of the current sections first. Please feel free to discuss something if you feel there is a need for that.
@ken-maeda Instead of uploading a .png file, I would prefer if you could add the code that could create/recreate the image and have it inside the notebook
@ken-maeda I haven't reviewed the whole notebook yet. Let's start with the current comments and some of the previous comments if you haven't addressed them yet.
Again, if you are in doubt, or you think there is a need to discuss something, please feel free to discuss it. When you are done, please let me know.
Also: Please pull the latest changes to your branch.
I fixed these typos and variables name. Then I rearranged the order I explain in introduction.
@ken-maeda
I reviewed up to the section Loading the ECG data for Contrast Profile. Whenever you address a comment, you can go to ReviewNB (see the top of this PR) and then click on "Resolve Conversation" whenever you are done with that comment.
I think you have done a great job so far, and the notebook becomes more and more clear.
@NimaSarajpoor I appriciate your kind feedback, I updated the notebook markdown.
@ken-maeda From my point of view, things are good so far. My only comment is the last figure as I think it is a little bit crowded (and I am not sure if there is a better way to break it down or not). Also, you may want to revise the last sentence. I think the last sentence talks about the desirable patterns but what you should have discussed is that the discovered discords via matrix profile (shown in "orange") is not desirable and someting like that.
@NimaSarajpoor I changed to separate last plot(maybe redundant?), as you mentioned, it was crowded it is hard to recognized where is indicated. I fixed the last sentence also.
@ken-maeda I provided my final touch on the notebook. After addressing the comments, we should see what @seanlaw thinks about this notebook.
If everything is okay, we can then move forward and add the part where you compute the contrast profile and show that it can discover the subsequences of our interest.
@ken-maeda How do you feel about the progress so far?
@NimaSarajpoor I'm sorry for the delay, I have fixed the notebook in the point you mentioned. I hope the notebook I created is fine now.
@ken-maeda Thank you for addressing the comments. While there is still some room for improvement, we can do it later. I think @seanlaw can take a look at the notebook now and see if he has any opinion / suggestion.
@seanlaw
Do you have any comment on the second notebook, docs/Tutorial_Contrast_Profile2.ipynb ?
Let me find some time to provide some comments
Apply stump to the entire time series what is the simplest thing that the user would have tried?
Regarding the point whether it is natural to calculate the Matrix Profile for the entire series first, the characteristic we are trying to find this time is neither motif nor discord, so I feel there is no motivation to calculate stump directly. Therefore, even if you simply calculate it for the entire series, I don't think there is much to say from the event itself, so I compared when there is a single discord and when there are multiple discords, and built on what can be said from there. It might be better to make it easy to understand what I'm trying to do from the very beginning.
Regarding the point whether it is natural to calculate the Matrix Profile for the entire series first, the characteristic we are trying to find this time is neither motif nor discord, so I feel there is no motivation to calculate stump directly. Therefore, even if you simply calculate it for the entire series, I don't think there is much to say from the event itself, so I compared when there is a single discord and when there are multiple discords, and built on what can be said from there. It might be better to make it easy to understand what I'm trying to do from the very beginning.
I think the point is that stump will not be able to help you here precisely because the subsequence is neither a top motif or a top discord. Certainly, if you traverse down to the top-N motifs then you might eventually find it. I think it is important to motivate "why" computing the full matrix profile is not enough and also demonstrate its ineffectiveness for this particular problem (i.e., when the subsequence of interest does not have a nearest neighbor that is as close as other motifs)
I think it is important to motivate "why" computing the full matrix profile is not enough and also demonstrate its ineffectiveness for this particular problem
-
It is challenging to set a goal prior to calculating the Matrix Profile across the whole data. When we have a signal pattern we want to find, searching for motifs or discords with the Matrix Profile may not seem natural. Users might question what to do in such cases and may find it unnatural to take action using the naive Matrix Profile.
-
Analyzing the result of applying the Matrix Profile to the whole data is difficult. As you mentioned, whether we can find what we're looking for largely depends on parts of the signal other than the current characteristic. Therefore, it's hard to say from the results what would be better from the perspective of the current characteristic .
-
Elements that should be explained in the introduction and elements that can be explained. I want to determine that. Currently, the overall flow is: 3-1. If a discord is included once, it can be found. 3-2. If a discord is included twice, it cannot be found. 3-3. So, what should we do? Regarding this flow, I thought that it would be better to write more concisely at the beginning about what the Contrast Profile brings. However, what do you think should be written in the introduction?