-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue parsing UTF-8 logs lines #4863
Comments
@pierreact Thanks for your report! |
@pierreact In Fluentd, if we specify a regular expression that contains non-ASCII characters in the parser, this can be a problem. We will investigate the correct use of regular expressions containing non-ASCII characters in the parser. def test_match(target_data, regexp_string)
puts "=== test_match ==="
puts "target_data: #{target_data} (#{target_data.encoding})"
puts "regexp_string: #{regexp_string} (#{regexp_string.encoding})"
result = /#{regexp_string}/.match(target_data)
puts "Success!"
p result
rescue => e
puts "Failed!"
p e
end
test_match(
"あいうえお".force_encoding(Encoding::UTF_8),
"う".force_encoding(Encoding::UTF_8)
)
test_match(
"あいうえお".force_encoding(Encoding::ASCII_8BIT),
"う".force_encoding(Encoding::UTF_8)
)
test_match(
"あいうえお".force_encoding(Encoding::UTF_8),
"う".force_encoding(Encoding::ASCII_8BIT)
)
test_match(
"あいうえお".force_encoding(Encoding::ASCII_8BIT),
"う".force_encoding(Encoding::ASCII_8BIT)
)
|
Describe the bug
When trying a regex parse, I couldn't match the plus-minus unicode character (±)
I ensured my config file, my log file and my config (encoding UTF-8) all pointed towards utf-8.
All those failed to match. ±, \uc2b1, \xc2\xb1
In despair, I'm matching "unknown character" for now (\uFFFD).
In fluentd docker image, I enabled -vv to get some more insight and discovered the following:
With fluent-package:
Shall we have something like -Eutf-8:utf-8 defined instead or maybe an easy wait to override it?
Thank you.
To Reproduce
Run a regex matching unicode.
Expected behavior
Should be handling utf-8 by default but this could be subject to taste.
Otherwise, document it (Have I missed it?) and provide an easy way to override it.
Your Environment
Your Configuration
Your Error Log
2025-03-06 10:25:32 +0000 [warn]: #0 fluent/log.rb:383:warn: pattern not matched: "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD"
Additional context
Here's a log line:
Here's my source:
Here's the error I get:
If I replace the expression with this, it goes through and is properly split:
\uFFFD is "unrecognized character".
My fluentd.conf is using UTF-8
My log file is using UTF-8
My config, as you can see above is UTF-8
The only thing I see that is not UTF-8 is the ruby argument defining encoding.
So even when simplifying things a lot I don't see a proper handling of unicode.
I tried ±, \±, \uc2b1, \xc2\xb1
The text was updated successfully, but these errors were encountered: