torch_ecg icon indicating copy to clipboard operation
torch_ecg copied to clipboard

Dataset

Open sungggat opened this issue 2 years ago • 8 comments

How to get CINC2021 dataset? How to download dataset from url you provided in benchmarks. I could not find prepare_dataset.py but I found it from original repo.

sungggat avatar Apr 03 '23 08:04 sungggat

Just call the download method. And of course, you may also download the zip files from google cloud using some other tools and uncompress them manually. The prepare_dataset function in the original repo was created since I had to keep the files in specific subfolders to maintain the paths. The _ls_rec method was updated and the paths are maintained in a pandas DataFrame now, so moving files in the prepare_dataset function is unnecessary and thus removed.

wenh06 avatar Apr 04 '23 01:04 wenh06

I downloaded dataset Cinc 2021 from https://physionet.org/content/challenge-2021/#files . I want to run trainer.py from benchmarks/cinc2021. I also added ds_train and ds val. `

TrainCfg.db_dir  = 'data/CINC2021/physionet.org/files/challenge-2021/1.0.3/training/'

ds_train = CINC2021(TrainCfg, training=True, lazy=True)
ds_val = CINC2021(TrainCfg, training=False, lazy=True)

`

I am getting below error:

File "trainer.py", line 423, in ds_train = CINC2021(TrainCfg, training=True, lazy=True) File "/workspace/torch_ecg/benchmarks/train_crnn_cinc2021/dataset.py", line 101, in init self.config.train_ratio, force_recompute=False File "/workspace/torch_ecg/benchmarks/train_crnn_cinc2021/dataset.py", line 306, in _train_test_split self.reader.all_records[t], dynamic_ncols=True, mininterval=1.0 TypeError: len() takes no keyword arguments

sungggat avatar Apr 11 '23 04:04 sungggat

It's a typo in this file, which happened perhaps when doing copy-paste (from torch_ecg/databases/datasets/cinc2021/cinc2021_dataset.py). The right bracket of this len function was missing, and was added at a wrong place (perhaps by Copilot?). It is now corrected in 20203caab4945994bfff6df7df702b1656406600.

wenh06 avatar Apr 11 '23 14:04 wenh06

Hi, I'm trying to run trainer.py for train_hybrid_cpsc2020. I have downloaded the CPSC 2020 dataset and specified the data path inside cfg.py like this: BaseCfg.db_dir = 'D:/AUT/Data_Lab/Implementation/TinyML/data/TrainingSet/' TrainingSet contains two subfolders, namely data and ref in which exist 10 .mat files. but come across this error whenever I run trainer.py

File "C:\Users\AK\miniconda3\envs\cpsc\Lib\site-packages\torch\utils\data\dataloader.py", line 350, in init sampler = RandomSampler(dataset, generator=generator) # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\AK\miniconda3\envs\cpsc\Lib\site-packages\torch\utils\data\sampler.py", line 143, in init raise ValueError(f"num_samples should be a positive integer value, but got num_samples={self.num_samples}") ValueError: num_samples should be a positive integer value, but got num_samples=0 Any advice on how can I fix this?

AK-mehr avatar Mar 25 '24 15:03 AK-mehr

It seems that the data reader did not find the recording files. The CPSC2020 data reader searches for the recordings and annotation files using the following method:

    def _ls_rec(self) -> None:
        """Find all records in the database directory
        and store them (path, metadata, etc.) in some private attributes.
        """
        self._df_records = pd.DataFrame()
        n_records = 10
        all_records = [f"A{i:02d}" for i in range(1, 1 + n_records)]
        self._df_records["path"] = [path for path in self.db_dir.rglob(f"*.{self.rec_ext}") if path.stem in all_records]
        self._df_records["record"] = self._df_records["path"].apply(lambda x: x.stem)
        self._df_records.set_index("record", inplace=True)

        all_annotations = [f"R{i:02d}" for i in range(1, 1 + n_records)]
        df_ann = pd.DataFrame()
        df_ann["ann_path"] = [path for path in self.db_dir.rglob(f"*.{self.ann_ext}") if path.stem in all_annotations]
        df_ann["record"] = df_ann["ann_path"].apply(lambda x: x.stem.replace("R", "A"))
        df_ann.set_index("record", inplace=True)
        # take the intersection by the index of `df_ann` and `self._df_records`
        self._df_records = self._df_records.join(df_ann, how="inner")

        if len(self._df_records) > 0:
            if self._subsample is not None:
                size = min(
                    len(self._df_records),
                    max(1, int(round(self._subsample * len(self._df_records)))),
                )
                self._df_records = self._df_records.sample(n=size, random_state=DEFAULTS.SEED, replace=False)

        self._all_records = self._df_records.index.tolist()
        self._all_annotations = self._df_records["ann_path"].apply(lambda x: x.stem).tolist()

Theoretically, you can pass any of its parents because the pathlib.Path.rglob is used.

wenh06 avatar Mar 26 '24 02:03 wenh06

I think I know the reason now. The CPSC2020 dataset uses sliced recordings since the original recordings are fairly long. So, you should call the persistence method first, which takes quite a long time to slice the recordings.

wenh06 avatar Mar 26 '24 02:03 wenh06

Thank you for your guidance, it seems like training requires a CNN.h5 and a CRNN.h5 file located in signal_processing/ecg_rpeaks_dl_models directory but I only have the corresponding json files. It's worth noting that I've only run trainer.py. Should I do anything before running trainer.py? could you please help me on this one as well?

AK-mehr avatar Mar 26 '24 10:03 AK-mehr

I added automatic downloading of these models, which you can find in https://opensz.oss-cn-beijing.aliyuncs.com/ICBEB2020/file/CPSC2019-opensource.zip. However, these models were trained with a very older version of Keras. One might have trouble loading these models. I also removed the auto-load of deep learning models in the signal_processing module.

The changes were made in the dev branch currently and will be merged into the master branch soon.

wenh06 avatar Mar 26 '24 17:03 wenh06