# Data Exploration

Let's begin by exploring data in the MIMIC Waveform Database.

Our **objectives** are to:
- Review the structure of the MIMIC Waveform Database (considering subjects, studies, records, and segments).
- Load waveforms using the WFDB toolbox.
- Find out which signals are present in selected records and segments, and how long the signals last.
- Search for records that contain signals of interest.

<div class="alert alert-block alert-warning">
<p><b>Resource:</b> You can find out more about the MIMIC Waveform Database <a href="https://physionet.org/content/mimic4wdb/0.1.0/">here</a>.</p>
</div>

---
## Setup

### Specify the required Python packages
We'll import the following:
- _sys_: an essential python package
- _pathlib_ (well a particular function from _pathlib_, called _Path_)

In [None]:
import sys
from pathlib import Path

### Specify a particular version of the WFDB Toolbox

- _wfdb_: For this workshop we will be using version 4 of the WaveForm DataBase (WFDB) Toolbox package. The package contains tools for processing waveform data such as those found in MIMIC:

In [None]:
!pip install wfdb==4.0.0
import wfdb

<div class="alert alert-block alert-warning">
<p><b>Resource:</b> You can find out more about the WFDB package <a href="https://physionet.org/content/wfdb-python/3.4.1/">here</a>.</p>
</div>

Now that we have imported these packages (_i.e._ toolboxes) we have a set of tools (functions) ready to use.

### Specify the name of the MIMIC Waveform Database

- Specify the name of the MIMIC IV Waveform Database on Physionet, which comes from the URL: https://physionet.org/content/mimic4wdb/0.1.0/

In [None]:
database_name = 'mimic4wdb/0.1.0'

---
## Identify the records in the database

### Get a list of records

- Use the [`get_record_list`](https://wfdb.readthedocs.io/en/latest/io.html#wfdb.io.get_record_list) function from the WFDB toolbox to get a list of records in the database.

In [None]:
# each subject may be associated with multiple records
subjects = wfdb.get_record_list(database_name)
print(f"The '{database_name}' database contains data from {len(subjects)} subjects")

# set max number of records to load
max_records_to_load = 200

The 'mimic4wdb/0.1.0' database contains data from 198 subjects


In [None]:
# iterate the subjects to get a list of records
records = []
for subject in subjects:
    studies = wfdb.get_record_list(f'{database_name}/{subject}')
    for study in studies:
        records.append(Path(f'{subject}{study}'))
        # stop if we've loaded enough records
        if len(records) >= max_records_to_load:
            print("Reached maximum required number of records.")
            break

print(f"Loaded {len(records)} records from the '{database_name}' database.")

Reached maximum required number of records.
Loaded 200 records from the 'mimic4wdb/0.1.0' database.


### Look at the records

- Display the first few records

In [None]:
# format and print first five records
first_five_records = [str(x) for x in records[0:5]]
first_five_records = "\n - ".join(first_five_records)
print(f"First five records: \n - {first_five_records}")

print("""
Note the formatting of these records:
 - intermediate directory ('p100' in this case)
 - subject identifier (e.g. 'p10014354')
 - record identifier (e.g. '81739927'
 """)

First five records: 
 - waves/p100/p10014354/81739927/81739927
 - waves/p100/p10019003/87033314/87033314
 - waves/p100/p10020306/83404654/83404654
 - waves/p100/p10039708/83411188/83411188
 - waves/p100/p10039708/85583557/85583557

Note the formatting of these records:
 - intermediate directory ('p100' in this case)
 - subject identifier (e.g. 'p10014354')
 - record identifier (e.g. '81739927'
 


<div class="alert alert-block alert-info">
<p><b>Q:</b> Can you print the names of the last five records? <br> <b>Hint:</b> in Python, the last five elements can be specified using '[-5:]'</p>
</div>

---
## Extract metadata for a record

Each record contains metadata stored in a header file, named "`<record name>.hea`"

### Specify the online directory containing a record's data

In [None]:
# Specify the 4th record (note, in Python indexing begins at 0)
idx = 3
record = records[idx]
record_dir = f'{database_name}/{record.parent}'
print("PhysioNet directory specified for record: {}".format(record_dir))

PhysioNet directory specified for record: mimic4wdb/0.1.0/waves/p100/p10039708/83411188


### Specify the subject identifier

Extract the record name (e.g. '83411188') from the record (e.g. 'p100/p10039708/83411188/83411188'):

In [None]:
record_name = record.name
print("Record name: {}".format(record_name))

Record name: 83411188


### Load the metadata for this record
- Use the [`rdheader`](https://wfdb.readthedocs.io/en/latest/io.html#wfdb.io.rdheader) function from the WFDB toolbox to load metadata from the record header file

In [None]:
record_data = wfdb.rdheader(record_name, pn_dir=record_dir, rd_segments=True)
remote_url = "https://physionet.org/content/" + record_dir + "/" + record_name + ".hea"
print(f"Done: metadata loaded for record '{record_name}' from the header file at:\n{remote_url}")

Done: metadata loaded for record '83411188' from the header file at:
https://physionet.org/content/mimic4wdb/0.1.0/waves/p100/p10039708/83411188/83411188.hea


---
## Inspect details of physiological signals recorded in this record
- Printing a few details of the signals from the extracted metadata

In [None]:
print(f"- Number of signals: {record_data.n_sig}".format())
print(f"- Duration: {record_data.sig_len/(record_data.fs*60*60):.1f} hours") 
print(f"- Base sampling frequency: {record_data.fs} Hz")

- Number of signals: 6
- Duration: 14.2 hours
- Base sampling frequency: 62.4725 Hz


---
## Inspect the segments making up a record
Each record is typically made up of several segments

In [None]:
segments = record_data.seg_name
print(f"The {len(segments)} segments from record {record_name} are:\n{segments}")

The 6 segments from record 83411188 are:
['83411188_0000', '83411188_0001', '83411188_0002', '83411188_0003', '83411188_0004', '83411188_0005']


The format of filename for each segment is: `record directory, "_", segment number`

---
## Inspect an individual segment
### Read the metadata for this segment
- Read the metadata from the header file

In [None]:
segment_metadata = wfdb.rdheader(record_name=segments[2], pn_dir=record_dir)

print(f"""Header metadata loaded for: 
- the segment '{segments[2]}'
- in record '{record_name}'
- for subject '{str(Path(record_dir).parent.parts[-1])}'
""")

Header metadata loaded for: 
- the segment '83411188_0001'
- in record '83411188'
- for subject 'p10039708'



### Find out what signals are present

In [None]:
print(f"This segment contains the following signals: {segment_metadata.sig_name}")
print(f"The signals are measured in units of: {segment_metadata.units}")

This segment contains the following signals: ['II', 'V', 'aVR', 'ABP', 'Pleth', 'Resp']
The signals are measured in units of: ['mV', 'mV', 'mV', 'mmHg', 'NU', 'Ohm']


See [here](https://archive.physionet.org/mimic2/mimic2_waveform_overview.shtml#signals-125-samplessecond) for definitions of signal abbreviations.

<div class="alert alert-block alert-info">
<p><b>Q:</b> Which of these signals is no longer present in segment '83411188_0005'?</p>
</div>

### Find out how long each signal lasts

All signals in a segment are time-aligned, measured at the same sampling frequency, and last the same duration:

In [None]:
print(f"The signals have a base sampling frequency of {segment_metadata.fs:.1f} Hz")
print(f"and they last for {segment_metadata.sig_len/(segment_metadata.fs*60):.1f} minutes")

The signals have a base sampling frequency of 62.5 Hz
and they last for 0.9 minutes


## Identify records suitable for analysis

- The signals and their durations vary from one record (and segment) to the next. 
- Since most studies require specific types of signals (e.g. blood pressure and photoplethysmography signals), we need to be able to identify which records (or segments) contain the required signals and duration.

### Setup

In [None]:
import pandas as pd
from pprint import pprint

In [None]:
print(f"Earlier, we loaded {len(records)} records from the '{database_name}' database.")

Earlier, we loaded 200 records from the 'mimic4wdb/0.1.0' database.


### Specify requirements

- Required signals

In [None]:
required_sigs = ['ABP', 'Pleth']

- Required duration

In [None]:
# convert from minutes to seconds
req_seg_duration = 10*60 

### Find out how many records meet the requirements

_NB: This step may take a while. The results are copied below to save running it yourself._

In [None]:
matching_recs = {'dir':[], 'seg_name':[], 'length':[]}

for record in records:
    print('Record: {}'.format(record), end="", flush=True)
    record_dir = f'{database_name}/{record.parent}'
    record_name = record.name
    print(' (reading data)')
    record_data = wfdb.rdheader(record_name,
                                pn_dir=record_dir,
                                rd_segments=True)

    # Check whether the required signals are present in the record
    sigs_present = record_data.sig_name
    if not all(x in sigs_present for x in required_sigs):
        print('   (missing signals)')
        continue

    # Get the segments for the record
    segments = record_data.seg_name

    # Check to see if the segment is 10 min long
    # If not, move to the next one
    gen = (segment for segment in segments if segment != '~')
    for segment in gen:
        print(' - Segment: {}'.format(segment), end="", flush=True)
        segment_metadata = wfdb.rdheader(record_name=segment,
                                         pn_dir=record_dir)
        seg_length = segment_metadata.sig_len/(segment_metadata.fs)

        if seg_length < req_seg_duration:
            print(f' (too short at {seg_length/60:.1f} mins)')
            continue

        # Next check that all required signals are present in the segment
        sigs_present = segment_metadata.sig_name
        
        if all(x in sigs_present for x in required_sigs):
            matching_recs['dir'].append(record_dir)
            matching_recs['seg_name'].append(segment)
            matching_recs['length'].append(seg_length)
            print(' (met requirements)')
            # Since we only need one segment per record break out of loop
            break
        else:
            print(' (long enough, but missing signal(s))')

print(f"A total of {len(matching_recs['dir'])} records met the requirements:")

#df_matching_recs = pd.DataFrame(data=matching_recs)
#df_matching_recs.to_csv('matching_records.csv', index=False)
#p=1

In [None]:
print(f"A total of {len(matching_recs['dir'])} out of {len(records)} records met the requirements.")

relevant_segments_names = "\n - ".join(matching_recs['seg_name'])
print(f"\nThe relevant segment names are:\n - {relevant_segments_names}")

relevant_dirs = "\n - ".join(matching_recs['dir'])
print(f"\nThe corresponding directories are: \n - {relevant_dirs}")

A total of 52 out of 200 records met the requirements.

The relevant segment names are:
 - 83404654_0005
 - 82924339_0007
 - 84248019_0005
 - 82439920_0004
 - 82800131_0002
 - 84304393_0001
 - 89464742_0001
 - 88958796_0004
 - 88995377_0001
 - 85230771_0004
 - 86643930_0004
 - 81250824_0005
 - 87706224_0003
 - 83058614_0005
 - 82803505_0017
 - 88574629_0001
 - 87867111_0012
 - 84560969_0001
 - 87562386_0001
 - 88685937_0001
 - 86120311_0001
 - 89866183_0014
 - 89068160_0002
 - 86380383_0001
 - 85078610_0008
 - 87702634_0007
 - 84686667_0002
 - 84802706_0002
 - 81811182_0004
 - 84421559_0005
 - 88221516_0007
 - 80057524_0005
 - 84209926_0018
 - 83959636_0010
 - 89989722_0016
 - 89225487_0007
 - 84391267_0001
 - 80889556_0002
 - 85250558_0011
 - 84567505_0005
 - 85814172_0007
 - 88884866_0005
 - 80497954_0012
 - 80666640_0014
 - 84939605_0004
 - 82141753_0018
 - 86874920_0014
 - 84505262_0010
 - 86288257_0001
 - 89699401_0001
 - 88537698_0013
 - 83958172_0001

The corresponding directori

<div class="alert alert-block alert-info">
<p><b>Question:</b> Is this enough data for a study? Consider different types of studies, e.g. assessing the performance of a previously proposed algorithm to estimate BP from the PPG signal, vs. developing a deep learning approach to estimate BP from the PPG.</p>
</div>