Issue parsing UTF-8 logs lines #4863

pierreact · 2025-03-11T09:32:11Z

Describe the bug

When trying a regex parse, I couldn't match the plus-minus unicode character (±)
I ensured my config file, my log file and my config (encoding UTF-8) all pointed towards utf-8.

All those failed to match. ±, \uc2b1, \xc2\xb1
In despair, I'm matching "unknown character" for now (\uFFFD).

In fluentd docker image, I enabled -vv to get some more insight and discovered the following:

2025-02-28 10:42:12 +0000 [info]: fluent/log.rb:362:info: starting fluentd-1.18.0 pid=7 ruby="3.2.6"
2025-02-28 10:42:12 +0000 [info]: fluent/log.rb:362:info: spawn command to main:  cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-c", "/fluentd/etc/fluentd.conf", "-vv", "--plugin", "/fluentd/plugins", "--under-supervisor"]

With fluent-package:

# pgrep -a ruby
772464 /opt/fluent/bin/ruby -Eascii-8bit:ascii-8bit /opt/fluent/bin/fluentd --log /var/log/fluent/fluentd.log --daemon /var/run/fluent/fluentd.pid --under-supervisor

Shall we have something like -Eutf-8:utf-8 defined instead or maybe an easy wait to override it?

Thank you.

To Reproduce

Run a regex matching unicode.

Expected behavior

Should be handling utf-8 by default but this could be subject to taste.
Otherwise, document it (Have I missed it?) and provide an easy way to override it.

Your Environment

- Fluentd version: fluent-package 5.0.6-1 arm64 (Docker image was latest end of Feb).
- Operating system: Ubuntu 22.04
- Kernel version: 6.5.0

Your Configuration

# posfile is disabled on purpose here.
<source>
  @type tail
  path /var/log/somelogs/*.log

  tag sometag
  encoding UTF-8
  <parse>
    @type regexp
    expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ ]*)±(?<loglevel>[^ ]*)±(?<alphabetone>[^ ]*)±(?<alphabettwo>[^ ].*)±(?<tag>.*)±(?<message>.*)±(?<severity>[^ ]*)$/

    time_format %s.%N
    time_key timestamp_unix_and_us
    time_type string
    keep_time_key true

  </parse>
</source>

Your Error Log

2025-03-06 10:25:32 +0000 [warn]: #0 fluent/log.rb:383:warn: pattern not matched: "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD"

Additional context

Here's a log line:

Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780±INFO±ABCD±EFGH±MYTAG±HERE IS SOME MESSAGE±

Here's my source:

# posfile is disabled on purpose here.
<source>
  @type tail
  path /var/log/somelogs/*.log

  tag sometag
  encoding UTF-8
  <parse>
    @type regexp
    expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ ]*)±(?<loglevel>[^ ]*)±(?<alphabetone>[^ ]*)±(?<alphabettwo>[^ ].*)±(?<tag>.*)±(?<message>.*)±(?<severity>[^ ]*)$/

    time_format %s.%N
    time_key timestamp_unix_and_us
    time_type string
    keep_time_key true

  </parse>
</source>

Here's the error I get:

2025-03-06 10:25:32 +0000 [warn]: #0 fluent/log.rb:383:warn: pattern not matched: "Feb 19 13:00:07 SOMEHOSTNAME someprocess[605926]: 1739970007.466780\uFFFD\uFFFDINFO\uFFFD\uFFFDABCD\uFFFD\uFFFDEFGH\uFFFD\uFFFDMYTAG\uFFFD\uFFFDHERE IS SOME MESSAGE\uFFFD\uFFFD"

If I replace the expression with this, it goes through and is properly split:

expression /^(?<month>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<minute>\d+):(?<second>\d+)\s+(?<hostname>[^ ]*)\s+(?<processinfo>[^ ]*):\s+(?<timestamp_unix_and_us>[^ \uFFFD]*)\uFFFD\uFFFD(?<loglevel>[^ \uFFFD]*)\uFFFD\uFFFD(?<alphabetone>[^ \uFFFD]*)\uFFFD\uFFFD(?<alphabettwo>[^ \uFFFD]*)\uFFFD\uFFFD(?<tag>[^\uFFFD]*)\uFFFD\uFFFD(?<message>[^\uFFFD]*)\uFFFD\uFFFD(?<severity>[^ ]*)$/

\uFFFD is "unrecognized character".

My fluentd.conf is using UTF-8
My log file is using UTF-8
My config, as you can see above is UTF-8
The only thing I see that is not UTF-8 is the ruby argument defining encoding.

So even when simplifying things a lot I don't see a proper handling of unicode.

I tried ±, \±, \uc2b1, \xc2\xb1

The text was updated successfully, but these errors were encountered:

daipom · 2025-03-11T09:35:11Z

@pierreact Thanks for your report!
We will check this soon.

daipom · 2025-03-12T02:39:22Z

@pierreact
We found that, at least as a Ruby specification, Encoding::CompatibilityError occurs if the target data and the regular expression encoding do not match.

In Fluentd, if we specify a regular expression that contains non-ASCII characters in the parser, this can be a problem.
But it looks like it should work fine since you have specified utf-8 for the encoding of in_tail.

We will investigate the correct use of regular expressions containing non-ASCII characters in the parser.

def test_match(target_data, regexp_string)
  puts "=== test_match ===" 
  puts "target_data: #{target_data} (#{target_data.encoding})"
  puts "regexp_string: #{regexp_string} (#{regexp_string.encoding})"
  result = /#{regexp_string}/.match(target_data)
  puts "Success!"
  p result
rescue => e
  puts "Failed!"
  p e
end

test_match(
  "あいうえお".force_encoding(Encoding::UTF_8),
  "う".force_encoding(Encoding::UTF_8)
)
test_match(
  "あいうえお".force_encoding(Encoding::ASCII_8BIT),
  "う".force_encoding(Encoding::UTF_8)
)
test_match(
  "あいうえお".force_encoding(Encoding::UTF_8),
  "う".force_encoding(Encoding::ASCII_8BIT)
)
test_match(
  "あいうえお".force_encoding(Encoding::ASCII_8BIT),
  "う".force_encoding(Encoding::ASCII_8BIT)
)

=== test_match ===
target_data: あいうえお (UTF-8)
regexp_string: う (UTF-8)
Success!
#<MatchData "う">
=== test_match ===
target_data: あいうえお (ASCII-8BIT)
regexp_string: う (UTF-8)
Failed!
#<Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)>
=== test_match ===
target_data: あいうえお (UTF-8)
regexp_string: う (ASCII-8BIT)
Failed!
#<Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)>
=== test_match ===
target_data: あいうえお (ASCII-8BIT)
regexp_string: う (ASCII-8BIT)
Success!
#<MatchData "\xE3\x81\x86">

pierreact added the waiting-for-triage label Mar 11, 2025

github-project-automation bot added this to Fluentd Kanban Mar 11, 2025

pierreact mentioned this issue Mar 11, 2025

8bit ascii is used. fluent/fluentd-docker-image#426

Closed

daipom moved this to Triage in Fluentd Kanban Mar 11, 2025

daipom added work-in-progress and removed waiting-for-triage labels Mar 12, 2025

daipom moved this from Triage to Work-In-Progress in Fluentd Kanban Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue parsing UTF-8 logs lines #4863

Issue parsing UTF-8 logs lines #4863

pierreact commented Mar 11, 2025

daipom commented Mar 11, 2025

daipom commented Mar 12, 2025 •

edited

Loading

Issue parsing UTF-8 logs lines #4863

Issue parsing UTF-8 logs lines #4863

Comments

pierreact commented Mar 11, 2025

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

daipom commented Mar 11, 2025

daipom commented Mar 12, 2025 • edited Loading

daipom commented Mar 12, 2025 •

edited

Loading