fix(users/flokli/archaeology): don't use file but column compression
Clickhouse also has column compression, configurable with the output_format_parquet_compression_method setting. It defaults to lz4, and the previous setting got a a zstd-compressed parquet file with lz4 data. Set output_format_parquet_compression_method to zstd instead, and sort by timestamp before assembling the parquet file. The existing files were updated to the same format with the following query: ``` SELECT * FROM file('bucket_logs_2023-11-11*.pq', 'Parquet', 'auto') ORDER BY timestamp ASC INTO OUTFILE 'bucket_logs_2023-11-11.parquet' SETTINGS output_format_parquet_compression_method = 'zstd' ``` Change-Id: Id63b14c82e7bf4b9907a500528b569a51e277751 Reviewed-on: https://cl.tvl.fyi/c/depot/+/10008 Reviewed-by: raitobezarius <tvl@lahfa.xyz> Tested-by: BuildkiteCI
This commit is contained in:
parent
281cb93ba8
commit
46964f6d8f
1 changed files with 5 additions and 2 deletions
|
@ -29,8 +29,11 @@ fn main() -> ExitCode {
|
|||
'Regexp',
|
||||
'owner String , bucket String, timestamp_str String, remote_ip String, requester LowCardinality(String), request_id String, operation LowCardinality(String), key String, request_uri String, http_status String, error_code String, bytes_sent_str String, object_size_str String, total_time String, turn_around_time String, referer String, user_agent String, version_id String, host_id String, signature_version String, cipher_suite String, authentication_type String, host_header String, tls_version String, access_point_arn String, acl_required String'
|
||||
)
|
||||
SETTINGS format_regexp = '(\\S+) (\\S+) \\[(.*)\\] (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) ((?:\\S+ \\S+ \\S+)|\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+).*'
|
||||
INTO OUTFILE '{}' COMPRESSION 'zstd' FORMAT Parquet"#, input_files, output_file));
|
||||
ORDER BY timestamp ASC
|
||||
SETTINGS
|
||||
format_regexp = '(\\S+) (\\S+) \\[(.*)\\] (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) ((?:\\S+ \\S+ \\S+)|\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+).*',
|
||||
output_format_parquet_compression_method = 'zstd'
|
||||
INTO OUTFILE '{}' FORMAT Parquet"#, input_files, output_file));
|
||||
|
||||
cmd.status().expect("clickhouse-local failed");
|
||||
|
||||
|
|
Loading…
Reference in a new issue