Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgraded strings yield wrong results. #36

Open
FGasper opened this issue Feb 26, 2021 · 6 comments
Open

Upgraded strings yield wrong results. #36

FGasper opened this issue Feb 26, 2021 · 6 comments

Comments

@FGasper
Copy link

FGasper commented Feb 26, 2021

use strict;
use warnings;

use LMDB_File qw(:envflags :cursor_op);

use File::Temp;
my $dir = File::Temp::tempdir( CLEANUP => 1 );

my $d = "é";
my $u = "ü";
utf8::upgrade $u;

my $path = "$dir/file";

{
    my %hash;
    my $db = tie %hash, 'LMDB_File', "$dir/file", MDB_NOSUBDIR;

    $hash{$d} = $d;
    $hash{$u} = $u;
}

use Data::Dumper;
{
    my %hash;
    my $db = tie %hash, 'LMDB_File', "$dir/file", MDB_NOSUBDIR;

    print Dumper \%hash;
}

… prints:

$VAR1 = {
          'ü' => 'ü',
          'é' => 'é'
        };

The UTF8 option toggles between SvPV and SvPVutf8; it should actually toggle between SvPVbyte and SvPVutf8 to avoid a spurious dependency on Perl’s internal string storage logic.

@hoytech
Copy link
Contributor

hoytech commented Feb 26, 2021

Thanks! Any chance you could send a pull request? I'm sure @salortiz would accept it.

@salortiz
Copy link
Owner

salortiz commented Feb 26, 2021

@FGasper,

As your code shows, with MLDB_File what you put in is what you get out, so if you need any particular encoding your code should use the standard Encodemodule. For the already complex Perl's string handling the module don't attempts to be cleaver than the user.

Only If your code use a proper UTF-8 environment (via use utf8;, use feature 'unicode_strings' and use open ':std', IO => ..., etc.) you can enable the automatic handling setting $DB->UTF8(1).

And in LMDB_File's UTF8 mode a lexical use bytes pragma can be used as a escape hatch if you known what are you doing, YMMV.

@FGasper
Copy link
Author

FGasper commented Feb 27, 2021

@salortiz

My code was admittedly contrived—production code doesn’t normally call utf8::upgrade—but given Perl’s internal encoding abstraction, a Perl application doesn’t actually know “what it puts in” in terms of the PV itself.

Here are a couple more feasible ways of currently arriving at upgraded strings:

  • my $foo = "é"; $foo .= "\x{100}"; chop $foo
  • round-tripping through JSON (e.g., to duplicate a deep structure)

Of course, those won’t necessarily be true for any given Perl version, and that’s the point of the abstraction: things that aren’t the Perl interpreter shouldn’t care how Perl internally stores its strings. If two SVs eq each other in Perl, then the application should reasonably expect that those strings will present outside Perl the same way—as is the case with, e.g., writing to a filehandle as well as modules like MIME::Base64 and JSON::XS.

Encode doesn’t always (currently) behave as you’d think regarding the SV internals; for example, `decode('Latin-1') outputs upgraded strings, despite the fact that “decoding Latin-1” is basically a no-op.

Perl’s string handling isn’t all that complex; it’s just a bit buggy and weirdly documented. But the idea is that two strings are equal if they store the same code points, regardless of whether Perl stores them with the same internal encoding. SvPV violates that and is thus nearly always the wrong choice unless whatever uses it also checks SvUTF8.

Calling SvPVbyte would thus actually be less clever than calling SvPV since SvPVbyte always yields a consistent result regardless of Perl’s internals.

@FGasper
Copy link
Author

FGasper commented Feb 27, 2021

@salortiz I guess you’re in the same boat as CDB_File … this problem can’t really be fixed without potentially breaking existing applications.

We can work around this problem by utf8::downgrade()ing everything before sending it in to LMDB_File. Alternatively, would you be amenable to adding a ->bytes($bool) that does the SvPVbyte behaviour? I could implement that if you want.

@salortiz
Copy link
Owner

salortiz commented Mar 1, 2021

@FGasper,

I think that Perl (and SvPVs in particular) violates your expectations. But that don't means that Perl nor LMDB_File are buggy.

The module assumes that the user known what its data contains.

Your affirmation that

… the fact that “decoding Latin-1” is basically a no-op.

is plain false, the result don't even have the same length:

use v5.12; # Need 'say'
use Encode;
my $decoded = decode("latin1", my $l1acute = "\x{e1}");
say bytes::length $l1acute; # say "1"
say bytes::length $decoded; # say "2"

In your original example I don't known what do you expect to be in $u after utf8::upgrade($u).
But perl is clear that you has four bytes (certainly not an "ü" in any encoding), please run:

perl -E 'my $u = "ü"; utf8::upgrade($u); use bytes; say bytes::length($u)'

(You can fix your expectations adding a use utf8 telling perl that "ü" is already in UTF-8 so utf8::upgrade becomes a no-op)

As both utf8::upgrade and utf8::downgrade can potentially modify the content of the user's SV (depending of the current state of the UTF8 flag) and that is bad practice your idea of an unconditional utf8::downgrade cannot be considered, sorry.

If you think that $DB->bytes($bool) is needed, please make a PR.

@FGasper
Copy link
Author

FGasper commented Mar 1, 2021

@salortiz You’re using bytes.pm; that module’s own documentation emphatically says to avoid it except for debugging purposes because it breaks Perl’s string encapsulation.

utf8::upgrade does not change how $u generally behaves in Perl. It’s the same set of code points as before the upgrade; Perl just stores those code points differently. You can verify this by eq-comparing a string and its upgraded or downgraded equivalent. You can also print or say such strings, and the upgrade/downgrade won’t make a difference:

my $foo = "é";
printf "$foo (len: %d) $/", length $foo;
utf8::upgrade($foo);
printf "$foo (len: %d) $/", length $foo;

Thus, in theory it really shouldn’t matter if a module upgrades or downgrades a passed-in scalar. That said, I did follow your precedent of avoiding changing the passed-in scalar in my PR, for what that’s worth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants