I receive h5 files with large datasets (with several thousand columns). I use h5py to export these datasets in CSV which I then have to cut into sub-matrices of 500 columns max to be able to analyze them.
How to proceed (with h5py?) to directly export data into several CSV files with a defined number of columns?
Currently I cut these large CSV files with a C# program, but this is time consuming and adds unnecessary extra work.
Here is a simple example that creates an H5 file with field/column names, then reads slices (both by field name and row indices).
n_fields, n_rows = 10, 100
name_list = [f'Field_{i:02}' for i in range(1,n_fields+1) ]
format_list = ['float' for _ in range(n_fields)]
ds_dt = np.dtype({'names':name_list, 'formats':format_list})
with h5py.File('SO_77378785.h5','w') as h5f:
ds = h5f.create_dataset('test',shape=(n_rows,),dtype=ds_dt)
for i in range(n_fields):
ds[name_list[i]] = np.random.random(n_rows)
# Open file and read slices
with h5py.File('SO_77378785.h5') as h5f:
ds = h5f['test'] # creates dataset object for reference
field_1_slice = ds['Field_01'] # reads slice from dataset object
print(field_1_slice.shape)
field_2_slice = h5f['test']['Field_01'] # another way to slice
print(field_2_slice.shape)
row_slice = h5f['test']['Field_10'][:50]
print(row_slice.shape)
h5py and csv are not the same thing.. but in either case, splitting by column, is best done in RAM on the whole array.
Have you already tried a simple solution such as
array.tofile(file, sep=',')
orpandas.DataFrame.to_csv
?Why do you convert the HDF5 files to CSV? As you said “is time consuming and adds unnecessary extra work.” Most importantly converting creates duplicate copies of the data you have to track. There is no need to do that. Reading slices of HDF5 data with h5py is simple (either by field/column name or row indices).