How to split a large dataset into smaller datasets with a set number of columns with h5py?

I receive h5 files with large datasets (with several thousand columns). I use h5py to export these datasets in CSV which I then have to cut into sub-matrices of 500 columns max to be able to analyze them.

How to proceed (with h5py?) to directly export data into several CSV files with a defined number of columns?

Currently I cut these large CSV files with a C# program, but this is time consuming and adds unnecessary extra work.

  • h5py and csv are not the same thing.. but in either case, splitting by column, is best done in RAM on the whole array.

    – 

  • Have you already tried a simple solution such as array.tofile(file, sep=',') or pandas.DataFrame.to_csv?

    – 




  • Why do you convert the HDF5 files to CSV? As you said “is time consuming and adds unnecessary extra work.” Most importantly converting creates duplicate copies of the data you have to track. There is no need to do that. Reading slices of HDF5 data with h5py is simple (either by field/column name or row indices).

    – 

Here is a simple example that creates an H5 file with field/column names, then reads slices (both by field name and row indices).

n_fields, n_rows = 10, 100

name_list = [f'Field_{i:02}' for i in range(1,n_fields+1) ]
format_list = ['float' for _ in range(n_fields)]

ds_dt = np.dtype({'names':name_list, 'formats':format_list})

with h5py.File('SO_77378785.h5','w') as h5f:
    ds = h5f.create_dataset('test',shape=(n_rows,),dtype=ds_dt)
    for i in range(n_fields):
        ds[name_list[i]] = np.random.random(n_rows)

# Open file and read slices
with h5py.File('SO_77378785.h5') as h5f:
    ds = h5f['test'] # creates dataset object for reference
    field_1_slice = ds['Field_01'] # reads slice from dataset object
    print(field_1_slice.shape)
    field_2_slice = h5f['test']['Field_01'] # another way to slice
    print(field_2_slice.shape)
    row_slice = h5f['test']['Field_10'][:50]
    print(row_slice.shape)

Leave a Comment