How to split a large dataset into smaller datasets with a set number of columns with h5py?

Question 1

I receive h5 files with large datasets (with several thousand columns). I use h5py to export these datasets in CSV which I then have to cut into sub-matrices of 500 columns max to be able to analyze them.

How to proceed (with h5py?) to directly export data into several CSV files with a defined number of columns?

Currently I cut these large CSV files with a C# program, but this is time consuming and adds unnecessary extra work.

Question 2

Here is a simple example that creates an H5 file with field/column names, then reads slices (both by field name and row indices).

n_fields, n_rows = 10, 100

name_list = [f'Field_{i:02}' for i in range(1,n_fields+1) ]
format_list = ['float' for _ in range(n_fields)]

ds_dt = np.dtype({'names':name_list, 'formats':format_list})

with h5py.File('SO_77378785.h5','w') as h5f:
    ds = h5f.create_dataset('test',shape=(n_rows,),dtype=ds_dt)
    for i in range(n_fields):
        ds[name_list[i]] = np.random.random(n_rows)

# Open file and read slices
with h5py.File('SO_77378785.h5') as h5f:
    ds = h5f['test'] # creates dataset object for reference
    field_1_slice = ds['Field_01'] # reads slice from dataset object
    print(field_1_slice.shape)
    field_2_slice = h5f['test']['Field_01'] # another way to slice
    print(field_2_slice.shape)
    row_slice = h5f['test']['Field_10'][:50]
    print(row_slice.shape)

Leave a Comment Cancel reply