Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: sync llama.cpp #110

Merged
merged 6 commits into from
Jan 24, 2025
Merged

Conversation

a-ghorbani
Copy link
Contributor

@a-ghorbani a-ghorbani commented Jan 21, 2025

@Vali-98
Copy link
Contributor

Vali-98 commented Jan 21, 2025

Hey there, I was going to also bump the val[2048] size to [4096] to support the newer DeepSeekR1 prompt format which exceeds the old buffer size in PR #111

I wanted to ask if there was a better method of loading the model metadata, as reserving 4096 bytes for what is often a 2-4 byte float/uint seems somewhat wasteful. Ultimately its a very small optimization in the grand scheme of things, but it would be nice not to unnecessarily reserve memory.

@a-ghorbani
Copy link
Contributor Author

yes agree, I didn't spend time on this as increasing the buffer was a quick solution, although not optimal. If you have a way to optimize it would be nice.

@a-ghorbani
Copy link
Contributor Author

@Vali-98 I am not getting consistent responses for Deepseek R1. Testing on unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf

The output from Llama.cpp (aea8ddd5) cli seems very sensible and consistent:

> Hi
<think>
Okay, the user just said "Hi". That's pretty straightforward. I should respond in a friendly and welcoming way.

Maybe I can ask them how they're doing or if there's something specific they need help with.

Keeping it simple and open-ended seems like the best approach.
</think>

Hello! How can I assist you today? If you have any questions or need help with something, feel free to ask!

> what is 1+1?
<think>
Alright, the user just asked "what is 1+1?" which is a basic math problem. I need to provide a clear and correct answer.

I should respond by stating the result of the addition and maybe add a fun fact to keep it light and engaging. That should cover it.
</think>

1+1 equals **2**. That's a fundamental fact in mathematics!

> how about 8/2?
<think>
Okay, the user just asked "how about 8/2?" which is a division problem. I should respond in a friendly and helpful manner.

I'll provide the answer and maybe add a fun fact to make it more interesting. That should cover it.
</think>

8 divided by 2 is **4**. That's a simple division fact! I hope that's helpful! 😊

from PocketPal (using the same model and backend llama.cpp version)
Screenshot_20250121-154531 (1)

from ChatterUI (although, I didn't change any settings not sure if I had to apply any settings).
Screenshot_20250122-155536

Are you getting good results with any of DeepSeek-R1-Distills?

@Vali-98
Copy link
Contributor

Vali-98 commented Jan 22, 2025

Hey there, its also not working on my part.

I decided to test bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF to check if this was a model conversion error on unsloth's part, unfortunately not. Testing in Q4_0 also resulted in gibberish output. I'll see if I have time to investigate this thoroughly later.

I only actually tested the 8B distill which works flawlessly.

@a-ghorbani a-ghorbani marked this pull request as ready for review January 22, 2025 21:00
@jhen0409
Copy link
Member

I am not getting consistent responses for Deepseek R1. Testing on unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf

If I just build lib by -march=armv8-a, the q8 model is works fine, so I guess the issue may caused by some cpu features.

@a-ghorbani
Copy link
Contributor Author

If I just build lib by -march=armv8-a, the q8 model is works fine, so I guess the issue may caused by some cpu features.

Good catch! Indeed, using -march=armv8-a works much better (see the screenshot below).

I guess the issue may caused by some cpu features

If that's the case, could this be an issue on the llama.cpp side?

Screenshot_20250123_092333

@a-ghorbani
Copy link
Contributor Author

Just wanted to give an update in case you're looking into this. The culprit seems to be +fp16 for Android.

Perhaps we could investigate the compilation process with something like this:

    build_library("rnllama_v8" "-march=armv8-a")
    build_library("rnllama_v8_2" "-march=armv8.2-a")
    build_library("rnllama_v8_2_dotprod" "-march=armv8.2-a+dotprod")
    build_library("rnllama_v8_2_i8mm" "-march=armv8.2-a+i8mm")
    build_library("rnllama_v8_2_dotprod_i8mm" "-march=armv8.2-a+dotprod+i8mm")

and

  if (hasDotProd && hasI8mm) {
        Log.d(NAME, "Loading librnllama_v8_2_dotprod_i8mm.so");
        System.loadLibrary("rnllama_v8_2_dotprod_i8mm");
      } else if (hasDotProd) {
        Log.d(NAME, "Loading librnllama_v8_2_dotprod.so");
        System.loadLibrary("rnllama_v8_2_dotprod");
      } else if (hasI8mm) {
        Log.d(NAME, "Loading librnllama_v8_2_i8mm.so");
        System.loadLibrary("rnllama_v8_2_i8mm");
      } else if (hasFp16) {
        Log.d(NAME, "Loading librnllama_v8_2.so");
        System.loadLibrary("rnllama_v8_2");
      } else {
        Log.d(NAME, "Loading default librnllama_v8.so");
        System.loadLibrary("rnllama_v8");
      }

With this config, I was able to run the setup successfully on the:

  • OnePlus 6 (loads librnllama_v8_2.so)
  • Pixel 9 (loads librnllama_v8_2_dotprod_i8mm.so)
  • Emulator (loads librnllama_v8_2_dotprod.so)

All worked without any obvious issues.

I am no expert in these settings and compiler settings, but since what maters most for ggml/llama.cpp are i8mm and dotprods we should be good?

I'll run a few more tests, and if successful, we could use these settings.

But, this obviously won't help with the issue on iOS.

@Vali-98
Copy link
Contributor

Vali-98 commented Jan 23, 2025

Just as a quick check, iirc many flags are now checked within llama.cpp to check for neon/i8mm compatibility using lm_ggml_cpu_has_neon, lm_ggml_cpu_has_dotprod and lm_ggml_cpu_has_matmul_int8.

Is it possible to now collapse all builds into rnllama_v8_2_dotprod_i8mm" "-march=armv8.2-a+dotprod+i8mm to encompass all arm SOCs, and see if older devices would work?

@a-ghorbani
Copy link
Contributor Author

Just as a quick check, iirc many flags are now checked within llama.cpp to check for neon/i8mm compatibility using lm_ggml_cpu_has_neon, lm_ggml_cpu_has_dotprod and lm_ggml_cpu_has_matmul_int8.

Is it possible to now collapse all builds into rnllama_v8_2_dotprod_i8mm" "-march=armv8.2-a+dotprod+i8mm to encompass all arm SOCs, and see if older devices would work?

unfortunately, won't work:

2025-01-23 15:33:16.443 31626-25330 RNLLAMA_ANDROID_JNI     com.pocketpalai                      I  llama_init_from_model: graph splits = 1
2025-01-23 15:33:16.444 31626-25463 RNLLAMA_LOG_ANDROID     com.pocketpalai                      I  common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
--------- beginning of crash
2025-01-23 15:33:16.444 31626-25463 RNLLAMA_LOG_ANDROID     com.pocketpalai                      W  common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
2025-01-23 15:33:16.471 31626-25464 libc                    com.pocketpalai                      A  Fatal signal 4 (SIGILL), code 1 (ILL_ILLOPC), fault addr 0x6ebdfeeb10 in tid 25464 (AsyncTask #1), pid 31626 (com.pocketpalai)
2025-01-23 15:33:16.612 31626-25318 110                     com.pocketpalai                      I   OptJank - total:108 frameGap:108 delta#0#2#1#0#0#107
2025-01-23 15:33:16.612 31626-25318 111                     com.pocketpalai                      I  OptJank - big and big
2025-01-23 15:33:16.953 31626-25318 110                     com.pocketpalai                      I   OptJank - total:290 frameGap:300 delta#266#13#12#0#2#8
2025-01-23 15:33:16.968 25471-25471 DEBUG                   crash_dump64                         A  pid: 31626, tid: 25464, name: AsyncTask #1  >>> com.pocketpalai <<<
2025-01-23 15:33:16.983 25471-25471 DEBUG                   crash_dump64                         A        #00 pc 00000000000ccb10  /data/app/~~3rSfwEXkaoqDDGLws5rnkA==/com.pocketpalai-eRbNUmyfFjybNlMO-a8uvg==/base.apk!librnllama_v8_2_dotprod_i8mm.so (offset 0x2670000) (BuildId: 9f41aa9d98b500a46cf72895ca9d3c69daf184d1)
2025-01-23 15:33:16.983 25471-25471 DEBUG                   crash_dump64                         A        #01 pc 00000000000aa6b4  /data/app/~~3rSfwEXkaoqDDGLws5rnkA==/com.pocketpalai-eRbNUmyfFjybNlMO-a8uvg==/base.apk!librnllama_v8_2_dotprod_i8mm.so (offset 0x2670000) (BuildId: 9f41aa9d98b500a46cf72895ca9d3c69daf184d1)
2025-01-23 15:33:16.983 25471-25471 DEBUG                   crash_dump64                         A        #02 pc 00000000000ba5f0  /data/app/~~3rSfwEXkaoqDDGLws5rnkA==/com.pocketpalai-eRbNUmyfFjybNlMO-a8uvg==/base.apk!librnllama_v8_2_dotprod_i8mm.so (offset 0x2670000) (BuildId: 9f41aa9d98b500a46cf72895ca9d3c69daf184d1)

Copy link
Member

@jhen0409 jhen0409 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Very appreciate for the testing.

If the new build settings on Android are not breaks anything, we can use it. I have confirmed the problem doesn't happen on iOS.

@jhen0409 jhen0409 merged commit 7e56a2b into mybigday:main Jan 24, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

llama.cpp sync for SVE support for Q4_K_Ms
3 participants