Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue parsing UTF-8 logs lines #4863

Open
pierreact opened this issue Mar 11, 2025 · 2 comments
Open

Issue parsing UTF-8 logs lines #4863

pierreact opened this issue Mar 11, 2025 · 2 comments

Comments

@pierreact
Copy link

Describe the bug

When trying a regex parse, I couldn't match the plus-minus unicode character (±)
I ensured my config file, my log file and my config (encoding UTF-8) all pointed towards utf-8.

All those failed to match. ±, \uc2b1, \xc2\xb1
In despair, I'm matching "unknown character" for now (\uFFFD).

In fluentd docker image, I enabled -vv to get some more insight and discovered the following:

2025-02-28 10:42:12 +0000 [info]: fluent/log.rb:362:info: starting fluentd-1.18.0 pid=7 ruby="3.2.6"
2025-02-28 10:42:12 +0000 [info]: fluent/log.rb:362:info: spawn command to main:  cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-c", "/fluentd/etc/fluentd.conf", "-vv", "--plugin", "/fluentd/plugins", "--under-supervisor"]

With fluent-package:

# pgrep -a ruby
772464 /opt/fluent/bin/ruby -Eascii-8bit:ascii-8bit /opt/fluent/bin/fluentd --log /var/log/fluent/fluentd.log --daemon /var/run/fluent/fluentd.pid --under-supervisor

Shall we have something like -Eutf-8:utf-8 defined instead or maybe an easy wait to override it?

Thank you.

To Reproduce

Run a regex matching unicode.

Expected behavior

Should be handling utf-8 by default but this could be subject to taste.
Otherwise, document it (Have I missed it?) and provide an easy way to override it.

Your Environment

- Fluentd version: fluent-package 5.0.6-1 arm64 (Docker image was latest end of Feb).
- Operating system: Ubuntu 22.04
- Kernel version: 6.5.0

Your Configuration

# posfile is disabled on purpose here.
<source>
  @type tail
  path /var/log/somelogs/*.log

  tag sometag
  encoding UTF-8
  <parse>
    @type regexp
    expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ ]*)±(?<loglevel>[^ ]*)±(?<alphabetone>[^ ]*)±(?<alphabettwo>[^ ].*)±(?<tag>.*)±(?<message>.*)±(?<severity>[^ ]*)$/

    time_format %s.%N
    time_key timestamp_unix_and_us
    time_type string
    keep_time_key true

  </parse>
</source>

Your Error Log

2025-03-06 10:25:32 +0000 [warn]: #0 fluent/log.rb:383:warn: pattern not matched: "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD"

Additional context

Here's a log line:

Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780±INFO±ABCD±EFGH±MYTAG±HERE IS SOME MESSAGE±

Here's my source:

# posfile is disabled on purpose here.
<source>
  @type tail
  path /var/log/somelogs/*.log

  tag sometag
  encoding UTF-8
  <parse>
    @type regexp
    expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ ]*)±(?<loglevel>[^ ]*)±(?<alphabetone>[^ ]*)±(?<alphabettwo>[^ ].*)±(?<tag>.*)±(?<message>.*)±(?<severity>[^ ]*)$/

    time_format %s.%N
    time_key timestamp_unix_and_us
    time_type string
    keep_time_key true

  </parse>
</source>

Here's the error I get:

2025-03-06 10:25:32 +0000 [warn]: #0 fluent/log.rb:383:warn: pattern not matched: "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD"

If I replace the expression with this, it goes through and is properly split:

expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ \uFFFD]*)\uFFFD\uFFFD(?<loglevel>[^ \uFFFD]*)\uFFFD\uFFFD(?<alphabetone>[^ \uFFFD]*)\uFFFD\uFFFD(?<alphabettwo>[^ \uFFFD]*)\uFFFD\uFFFD(?<tag>[^\uFFFD]*)\uFFFD\uFFFD(?<message>[^\uFFFD]*)\uFFFD\uFFFD(?<severity>[^ ]*)$/

\uFFFD is "unrecognized character".

My fluentd.conf is using UTF-8
My log file is using UTF-8
My config, as you can see above is UTF-8
The only thing I see that is not UTF-8 is the ruby argument defining encoding.

So even when simplifying things a lot I don't see a proper handling of unicode.

I tried ±, \±, \uc2b1, \xc2\xb1

@daipom
Copy link
Contributor

daipom commented Mar 11, 2025

@pierreact Thanks for your report!
We will check this soon.

@daipom
Copy link
Contributor

daipom commented Mar 12, 2025

@pierreact
We found that, at least as a Ruby specification, Encoding::CompatibilityError occurs if the target data and the regular expression encoding do not match.

In Fluentd, if we specify a regular expression that contains non-ASCII characters in the parser, this can be a problem.
But it looks like it should work fine since you have specified utf-8 for the encoding of in_tail.

We will investigate the correct use of regular expressions containing non-ASCII characters in the parser.

def test_match(target_data, regexp_string)
  puts "=== test_match ===" 
  puts "target_data: #{target_data} (#{target_data.encoding})"
  puts "regexp_string: #{regexp_string} (#{regexp_string.encoding})"
  result = /#{regexp_string}/.match(target_data)
  puts "Success!"
  p result
rescue => e
  puts "Failed!"
  p e
end

test_match(
  "あいうえお".force_encoding(Encoding::UTF_8),
  "う".force_encoding(Encoding::UTF_8)
)
test_match(
  "あいうえお".force_encoding(Encoding::ASCII_8BIT),
  "う".force_encoding(Encoding::UTF_8)
)
test_match(
  "あいうえお".force_encoding(Encoding::UTF_8),
  "う".force_encoding(Encoding::ASCII_8BIT)
)
test_match(
  "あいうえお".force_encoding(Encoding::ASCII_8BIT),
  "う".force_encoding(Encoding::ASCII_8BIT)
)
=== test_match ===
target_data: あいうえお (UTF-8)
regexp_string: う (UTF-8)
Success!
#<MatchData "う">
=== test_match ===
target_data: あいうえお (ASCII-8BIT)
regexp_string: う (UTF-8)
Failed!
#<Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)>
=== test_match ===
target_data: あいうえお (UTF-8)
regexp_string: う (ASCII-8BIT)
Failed!
#<Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)>
=== test_match ===
target_data: あいうえお (ASCII-8BIT)
regexp_string: う (ASCII-8BIT)
Success!
#<MatchData "\xE3\x81\x86">

@daipom daipom moved this from Triage to Work-In-Progress in Fluentd Kanban Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Work-In-Progress
Development

No branches or pull requests

2 participants