parquet_dataconverter¶
DataConverter for the Parquet backend.
- class graphnet.data.parquet.parquet_dataconverter.ParquetDataConverter(extractors, outdir, gcd_rescue, *, nb_files_to_batch, sequential_batch_pattern, input_file_batch_pattern, workers, index_column, icetray_verbose)[source]¶
Bases:
DataConverter
Class for converting I3-files to Parquet format.
Construct DataConverter.
When using input_file_batch_pattern, regular expressions are used to group files according to their names. All files that match a certain pattern up to wildcards are grouped into the same output file. This output file has the same name as the input files that are group into it, with wildcards replaced with “x”. Periods (.) and wildcards (*) have a special meaning: Periods are interpreted as literal periods, and not as matching any character (as in standard regex); and wildcards are interpreted as “.*” in standard regex.
For instance, the pattern “[A-Z]{1}_[0-9]{5}*.i3.zst” will find all I3 files whose names contain:
one capital letter, followed by
an underscore, followed by
five numbers, followed by
any string of characters ending in “.i3.zst”
- This means that, e.g., the files:
upgrade_genie_step4_141020_A_000000.i3.zst
upgrade_genie_step4_141020_A_000001.i3.zst
…
upgrade_genie_step4_141020_A_000008.i3.zst
upgrade_genie_step4_141020_A_000009.i3.zst
would be grouped into the output file named “upgrade_genie_step4_141020_A_00000x.<suffix>” but the file
upgrade_genie_step4_141020_A_000010.i3.zst
would end up in a separate group, named “upgrade_genie_step4_141020_A_00001x.<suffix>”.
- Parameters:
extractors (List[I3Extractor]) –
outdir (str) –
gcd_rescue (str | None) –
nb_files_to_batch (int | None) –
sequential_batch_pattern (str | None) –
input_file_batch_pattern (str | None) –
workers (int) –
index_column (str) –
icetray_verbose (int) –
-
file_suffix:
str
= 'parquet'¶
- save_data(data, output_file)[source]¶
Save data to parquet file.
- Return type:
None
- Parameters:
data (List[OrderedDict]) –
output_file (str) –
- merge_files(output_file, input_files)[source]¶
Parquet-specific method for merging output files.
- Parameters:
output_file (
str
) – Name of the output file containing the merged results.input_files (
Optional
[List
[str
]], default:None
) – Intermediate files to be merged, according to the specific implementation. Default to None, meaning that all files output by the current instance are merged.
- Raises:
NotImplementedError – If the method has not been implemented for the Parquet backend.
- Return type:
None