Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#6561] improvement(core): Cache Hadoop Filesystem instance on Gravitino server to improve the performance #6619

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sunxiaojian
Copy link
Contributor

@sunxiaojian sunxiaojian commented Mar 6, 2025

What changes were proposed in this pull request?

Cache Hadoop Filesystem instance on Gravitino server to improve the performance

Why are the changes needed?

Fix: #6561

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

N/A

@jerqi jerqi changed the title [#6561] improvement(core)Cache Hadoop Filesystem instance on Gravitino server to improve the performance [#6561] improvement(core): Cache Hadoop Filesystem instance on Gravitino server to improve the performance Mar 6, 2025
@sunxiaojian
Copy link
Contributor Author

@yuqi1129 PTAL, thanks

@yuqi1129
Copy link
Contributor

yuqi1129 commented Mar 8, 2025

I will take some time to verify whether this works in the loud filesystem as the related ITs are not triggered by default.

* @return The FileSystem instance.
* @throws IOException If the FileSystem instance cannot be created.
*/
default FileSystem getFileSystem(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this method only be used by the Gravitino server?

@Nonnull Path path, @Nonnull Map<String, String> config, boolean disableCache)
throws IOException {
// disable cache
config.put(String.format("fs.%s.impl.disable.cache", scheme()), String.valueOf(disableCache));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the config already contains key fs.%s.impl.disable.cache? will we cover it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will overwrite. I think we should prioritize the disableCache parameter.

@sunxiaojian
Copy link
Contributor Author

I will take some time to verify whether this works in the loud filesystem as the related ITs are not triggered by default.

ok

@yuqi1129 yuqi1129 requested a review from xloya March 10, 2025 03:31
@yuqi1129
Copy link
Contributor

@xloya @jerryshao I don't see any issues. Can you take a look?

@xloya
Copy link
Contributor

xloya commented Mar 10, 2025

If I understand correctly, the current version of FileSystem Provider will also be used for Hadoop GVFS. I think using Hadoop's own cache here will cause security issues, because the authenticated FileSystem is no longer encapsulated by GVFS, and others can arbitrarily obtain the authenticated FileSystem through FileSystem.get() and do some unauthorized behavior.
If you just want to add a filesystem cache on the server side, I think you can provide a cache class. Otherwise, we need to ensure that the client cannot get the authenticated filesystem at will.

@sunxiaojian
Copy link
Contributor Author

If I understand correctly, the current version of FileSystem Provider will also be used for Hadoop GVFS. I think using Hadoop's own cache here will cause security issues, because the authenticated FileSystem is no longer encapsulated by GVFS, and others can arbitrarily obtain the authenticated FileSystem through FileSystem.get() and do some unauthorized behavior. If you just want to add a filesystem cache on the server side, I think you can provide a cache class. Otherwise, we need to ensure that the client cannot get the authenticated filesystem at will.

thanks, Let me check it again.

@yuqi1129
Copy link
Contributor

If I understand correctly, the current version of FileSystem Provider will also be used for Hadoop GVFS. I think using Hadoop's own cache here will cause security issues, because the authenticated FileSystem is no longer encapsulated by GVFS, and others can arbitrarily obtain the authenticated FileSystem through FileSystem.get() and do some unauthorized behavior. If you just want to add a filesystem cache on the server side, I think you can provide a cache class. Otherwise, we need to ensure that the client cannot get the authenticated filesystem at will.

On the Gravitino server side, the fs cache is only used by the Gravitino server. Why can't we use the file system cache? @xloya , can you help to clarify it more clearly.

In GVFS client, indeed, there will be a security vulnerability if the fs cache is enabled.

@sunxiaojian
Copy link
Contributor Author

The GVFS side uses its own maintained cache, and when GVS initializing Filesystem, it will force not to use FS cache

@xloya
Copy link
Contributor

xloya commented Mar 13, 2025

If I understand correctly, the current version of FileSystem Provider will also be used for Hadoop GVFS. I think using Hadoop's own cache here will cause security issues, because the authenticated FileSystem is no longer encapsulated by GVFS, and others can arbitrarily obtain the authenticated FileSystem through FileSystem.get() and do some unauthorized behavior. If you just want to add a filesystem cache on the server side, I think you can provide a cache class. Otherwise, we need to ensure that the client cannot get the authenticated filesystem at will.

On the Gravitino server side, the fs cache is only used by the Gravitino server. Why can't we use the file system cache? @xloya , can you help to clarify it more clearly.

In GVFS client, indeed, there will be a security vulnerability if the fs cache is enabled.

What I mean is that we could enable Hadoop Filesystem cache in the server side, but the current Filesystem Provider is not only designed for the server, but also for the client. I think we need to consider how to enable Hadoop Filesystem cache only on the server, but not on the client.
In addition, the Hadoop GVFS client already has an internal FileSystem cache that is independent of the Hadoop Filesystem cache, so there is no need to use FileSystem.get() to cache the FileSystem to improve performance. Using FileSystem.get() in the Hadoop GVFS client will cause security issues instead.
So from this conclusion, I think there are two ways to solve this problem:

  1. Keep the current way of using FileSystem Provider on both the client and the server, and still use Filesystem.newInstance() to always create a new Filesystem. At the same time, add an internal FileSystem cache on the server, just like the implementation of Hadoop GVFS.
  2. Consider splitting or modifying the logic of FileSystem Provider to support different logics on the client and server to avoid the security issues of the client's Hadoop FileSystem cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache Hadoop Filesystem instance on Gravitino server to improve the performance
3 participants