[#6561] improvement(core): Cache Hadoop Filesystem instance on Gravitino server to improve the performance #6619

sunxiaojian · 2025-03-06T04:21:48Z

What changes were proposed in this pull request?

Cache Hadoop Filesystem instance on Gravitino server to improve the performance

Why are the changes needed?

Fix: #6561

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

N/A

sunxiaojian · 2025-03-06T10:40:20Z

@yuqi1129 PTAL, thanks

…erformance

yuqi1129 · 2025-03-08T02:43:53Z

I will take some time to verify whether this works in the loud filesystem as the related ITs are not triggered by default.

yuqi1129 · 2025-03-08T02:45:14Z

...s/hadoop-common/src/main/java/org/apache/gravitino/catalog/hadoop/fs/FileSystemProvider.java

+   * @return The FileSystem instance.
+   * @throws IOException If the FileSystem instance cannot be created.
+   */
+  default FileSystem getFileSystem(


Will this method only be used by the Gravitino server?

yuqi1129 · 2025-03-08T02:46:28Z

...s/hadoop-common/src/main/java/org/apache/gravitino/catalog/hadoop/fs/FileSystemProvider.java

+      @Nonnull Path path, @Nonnull Map<String, String> config, boolean disableCache)
+      throws IOException {
+    // disable cache
+    config.put(String.format("fs.%s.impl.disable.cache", scheme()), String.valueOf(disableCache));


What if the config already contains key fs.%s.impl.disable.cache? will we cover it?

It will overwrite. I think we should prioritize the disableCache parameter.

sunxiaojian · 2025-03-09T15:43:37Z

I will take some time to verify whether this works in the loud filesystem as the related ITs are not triggered by default.

ok

yuqi1129 · 2025-03-10T03:44:52Z

@xloya @jerryshao I don't see any issues. Can you take a look?

xloya · 2025-03-10T04:16:04Z

If I understand correctly, the current version of FileSystem Provider will also be used for Hadoop GVFS. I think using Hadoop's own cache here will cause security issues, because the authenticated FileSystem is no longer encapsulated by GVFS, and others can arbitrarily obtain the authenticated FileSystem through FileSystem.get() and do some unauthorized behavior.
If you just want to add a filesystem cache on the server side, I think you can provide a cache class. Otherwise, we need to ensure that the client cannot get the authenticated filesystem at will.

sunxiaojian · 2025-03-12T14:55:02Z

If I understand correctly, the current version of FileSystem Provider will also be used for Hadoop GVFS. I think using Hadoop's own cache here will cause security issues, because the authenticated FileSystem is no longer encapsulated by GVFS, and others can arbitrarily obtain the authenticated FileSystem through FileSystem.get() and do some unauthorized behavior. If you just want to add a filesystem cache on the server side, I think you can provide a cache class. Otherwise, we need to ensure that the client cannot get the authenticated filesystem at will.

thanks, Let me check it again.

yuqi1129 · 2025-03-13T03:51:10Z

If I understand correctly, the current version of FileSystem Provider will also be used for Hadoop GVFS. I think using Hadoop's own cache here will cause security issues, because the authenticated FileSystem is no longer encapsulated by GVFS, and others can arbitrarily obtain the authenticated FileSystem through FileSystem.get() and do some unauthorized behavior. If you just want to add a filesystem cache on the server side, I think you can provide a cache class. Otherwise, we need to ensure that the client cannot get the authenticated filesystem at will.

On the Gravitino server side, the fs cache is only used by the Gravitino server. Why can't we use the file system cache? @xloya , can you help to clarify it more clearly.

In GVFS client, indeed, there will be a security vulnerability if the fs cache is enabled.

sunxiaojian · 2025-03-13T04:21:36Z

The GVFS side uses its own maintained cache, and when GVS initializing Filesystem, it will force not to use FS cache

xloya · 2025-03-13T04:24:34Z

If I understand correctly, the current version of FileSystem Provider will also be used for Hadoop GVFS. I think using Hadoop's own cache here will cause security issues, because the authenticated FileSystem is no longer encapsulated by GVFS, and others can arbitrarily obtain the authenticated FileSystem through FileSystem.get() and do some unauthorized behavior. If you just want to add a filesystem cache on the server side, I think you can provide a cache class. Otherwise, we need to ensure that the client cannot get the authenticated filesystem at will.

On the Gravitino server side, the fs cache is only used by the Gravitino server. Why can't we use the file system cache? @xloya , can you help to clarify it more clearly.

In GVFS client, indeed, there will be a security vulnerability if the fs cache is enabled.

What I mean is that we could enable Hadoop Filesystem cache in the server side, but the current Filesystem Provider is not only designed for the server, but also for the client. I think we need to consider how to enable Hadoop Filesystem cache only on the server, but not on the client.
In addition, the Hadoop GVFS client already has an internal FileSystem cache that is independent of the Hadoop Filesystem cache, so there is no need to use FileSystem.get() to cache the FileSystem to improve performance. Using FileSystem.get() in the Hadoop GVFS client will cause security issues instead.
So from this conclusion, I think there are two ways to solve this problem:

Keep the current way of using FileSystem Provider on both the client and the server, and still use Filesystem.newInstance() to always create a new Filesystem. At the same time, add an internal FileSystem cache on the server, just like the implementation of Hadoop GVFS.
Consider splitting or modifying the logic of FileSystem Provider to support different logics on the client and server to avoid the security issues of the client's Hadoop FileSystem cache.

sunxiaojian force-pushed the issue-6561 branch from 4302bba to 69dd91f Compare March 6, 2025 06:23

jerqi changed the title ~~[#6561] improvement(core)Cache Hadoop Filesystem instance on Gravitino server to improve the performance~~ [#6561] improvement(core): Cache Hadoop Filesystem instance on Gravitino server to improve the performance Mar 6, 2025

Cache Hadoop Filesystem instance on Gravitino server to improve the p…

9b7d19b

…erformance

sunxiaojian force-pushed the issue-6561 branch from 69dd91f to 9b7d19b Compare March 7, 2025 04:13

yuqi1129 reviewed Mar 8, 2025

View reviewed changes

yuqi1129 requested a review from xloya March 10, 2025 03:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#6561] improvement(core): Cache Hadoop Filesystem instance on Gravitino server to improve the performance #6619

[#6561] improvement(core): Cache Hadoop Filesystem instance on Gravitino server to improve the performance #6619

sunxiaojian commented Mar 6, 2025 •

edited

Loading

sunxiaojian commented Mar 6, 2025

yuqi1129 commented Mar 8, 2025

yuqi1129 Mar 8, 2025

yuqi1129 Mar 8, 2025

sunxiaojian Mar 9, 2025

sunxiaojian commented Mar 9, 2025

yuqi1129 commented Mar 10, 2025

xloya commented Mar 10, 2025 •

edited

Loading

sunxiaojian commented Mar 12, 2025

yuqi1129 commented Mar 13, 2025

sunxiaojian commented Mar 13, 2025

xloya commented Mar 13, 2025 •

edited

Loading

[#6561] improvement(core): Cache Hadoop Filesystem instance on Gravitino server to improve the performance #6619

Are you sure you want to change the base?

[#6561] improvement(core): Cache Hadoop Filesystem instance on Gravitino server to improve the performance #6619

Conversation

sunxiaojian commented Mar 6, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

sunxiaojian commented Mar 6, 2025

yuqi1129 commented Mar 8, 2025

yuqi1129 Mar 8, 2025

Choose a reason for hiding this comment

yuqi1129 Mar 8, 2025

Choose a reason for hiding this comment

sunxiaojian Mar 9, 2025

Choose a reason for hiding this comment

sunxiaojian commented Mar 9, 2025

yuqi1129 commented Mar 10, 2025

xloya commented Mar 10, 2025 • edited Loading

sunxiaojian commented Mar 12, 2025

yuqi1129 commented Mar 13, 2025

sunxiaojian commented Mar 13, 2025

xloya commented Mar 13, 2025 • edited Loading

sunxiaojian commented Mar 6, 2025 •

edited

Loading

xloya commented Mar 10, 2025 •

edited

Loading

xloya commented Mar 13, 2025 •

edited

Loading