-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Signature errors #111
Comments
Doing some more work tonight, I finally found out it's directly related to mmap. When creating the upload signature, the first time it's all calculated correctly. For whatever reason often there is some http error, boto catches these errors and automatically retries them (function However if I calculate the hash in Having no idea what could possibly be the cause, I decided to remove mmap and read the part to a string, in memory. Using tiny 1 MB chunks that's no problem of course, even for 8 session. Result: it worked, flawlessly. I found two reconnection messages in the log, and there the new payload hash was correct. So that proves to me that the issue lies with mmap, somehow. But how? |
Nice findings, i don't know why either :) Mmap has to be considered as experimental for people not having enough So you have to support both. I will have new implementation based on your On Tue, Nov 13, 2012 at 4:44 PM, wvmarle notifications@github.com wrote:
|
Cool. I have discovered another thing, not mentioned: when calculating the hash, boto has a separate implementation for file-like objects, including our mmapped file. An mmap object can be used both as string, and as file. So when you give the part as string object, it uses the standard The comments mention that this is to prevent loading the complete file in memory. Which is neat of course, but somehow this is also what I suspect is messing up on retry. At least the result coming from that function is wrong on retry. When I tried and calculated the hash with Yesterday in the end I tried disabling that alternative implementation and have the mmapped part handled directly by |
Got it. The issue: we're giving boto an mmap object, while boto is not written with that in mind. They fully expect a string, and while mmap of course can be addressed as a regular string, it also has file-like capabilities. When the part is uploaded the first time around, the file seek pointer is at the beginning of the part, position 0. When the part is retried, the file seek pointer is at anything but 0 (I haven't figured out how that happens, just that it happens - it seems part of the part is read, then the http connection fails, and as a result the pointer is somewhere between 0 and part_size). Of course this messes up the calculation of the hash in This will have to be fixed in the
(inserted in We should certainly continue to use mmap objects, as otherwise each part has to be completely read into memory. Also for calculating the hashes we should really consider calling the boto function for that, as now using But why waste memory when we can do without? If the memory is there, the OS will cache the file in memory for speedy access. If it's not there, it all takes longer as the file has to be read from disk time and again, but at least it still works. So next steps: fork boto, create fix, wait for it to be accepted. |
One word: Awesome! :D Great work tracking this one down! On Nov 14, 2012, at 16:22 , wvmarle notifications@github.com wrote:
|
Well... I was too fast there... Now I don't get these response errors all the time (at least the hash Mmap is supposed to be thread-safe, so reading different parts from the So back to the drawing boards... Damn! A possible solution could be to not mmap several parts of the same file Also boto just says |
The errors are Somehow the reading of the mmapped file doesn't work as expected - hence the suggestion of creating tmp files. Created some minor fixes for boto and sent them a pull request for them; hope they will be accepted and incorporated soon. It'll also mean that we'll have to start using boto's development branch again! At least until next release. |
Wow i never expected there would be such problems with mmaping in python, but i think mmaping should be still considered experimental. At the same time it would be awesome if you find out what causes problems. |
What you need to do is to open new file for every process you make, using multiprocessing, then it must work! This is the only correct way to implement this and also the way i did, will report you my findings when code will be complete. Don't use same file descriptor for multiple mmap-s, glacier does not need threads/processes dependant on each other(this only makes problems), take a look at my code to see how it's supposed to be done. |
This is the first ever foray of mine in multiprocessing. So I don't know very well how to handle files and so, it's a bit of trial and error, and then if it breaks knowing where to look! Will try to change this, see what happens. The issues that I found appear to be not mmap errors as such, just wrongly handling things. Boto is clearly designed with just strings in mind, so when you start feeding it file-like stuff things may very well break, like the seek pointer that's not at the start of the file after a failed upload attempt. Then I also looked at your code. Interesting work, very different from my approach. Lots of error checking you do, I noticed. Messages will need to be improved though: too much developer-language, too hard to understand for a user what's actually going on, and what they did wrong. There is no GlacierWrapper: And for retrying: Boto retries a lot already. The two errors that I ran into (broken pipes and connection resets) were caught by Boto and retried. The only retryable error that I have seen coming back to our script is a ResponseException due to a connection time-out. Everything else that can be retried, is retried already. You just don't see much of that in the logs as I was forced to switch off debug logging for Boto due to them logging the complete payload... |
Hi, Well you can handle errors, check error codes and re-raise them if needed GlacierWrapper is only the library, it must not do any output to console, As far as i have seen boto, they haven't tought too mouch when implementing There's also an option that we fork boto and reimplement glacier the Logging.exception does exist
|
Should not exist - of course - but in their defence it's a rather On the other hand it's surprising no-one else ran into it, as it seems
They probably didn't plan much as it was written in weeks since the
Well actually you don't need that much time. Forking isn't that bad. First of all: take the original code from glaciercorecalls, most direct The only thing that we really don't have is the authentication bit, and And the part where they create a pool of 1,024 http connections to use Reading their code is hard, really hard. It sometimes even uses
Ah right. I was looking at the level names only. This is an extra one, |
Just a simple question, and not a technical one: I am getting these errors regularly, but nothing stops/breaks and eventually the file is uploaded. But is the file complete at Amazon? Doing a retrieval of the inventory gives me the same checksum as I have locally, but is the checksum given by me when uploading, or did Amazon calculate it itself when the file was completely uploaded? The reason for my worry is this, from the log:
Or to short it down:
As you can see, the logs never mentions to finish uploading part 2550136832-2684354559. Was it uploaded or not? |
This issue stems from #91, I think it's better to start a new thread.
In short: when attempting to do parallel uploads, the threads die one by one with a SignatureError. I'm trying to figure out where this problem originates, but haven't got far yet. It's getting late, so I'm calling it a night, hereby the status. Maybe someone can pick it up from here.
I have filtered bits and pieces of a test upload with eight sessions (guaranteed to trigger the issue), and here are what I think are the important snippets from my log.
Note: this uses a slightly changed version of boto to not log the payload (the 1 MB data chunks) in the log file, and also in glacier-cmd I have enabled boto's debug logging. That's normally switched off due to the payload logging... (details see discussion at #90 (comment)).
SHA256 hash of a part of a test archive: d9eee35c8b9a29e1590802e04948db300a2c601ce46f7a07797f1a5d6a0166b1
At 22:43:12 request is created correctly:
SHA256 and tree-hash are identical as it's a 1MB chunk. For larger chunks they will be different, and the hash at the end should be the complete payload hash (not the tree hash).
At 22:43:23 an unspecified error is encountered; boto correctly retries this one, and creates a new request. This request however uses a wrong hash for the payload (or maybe the whole payload is wrong this time?).
Not very surprisingly, this results in an error a little later:
The key issue is of course: why is the payload hash different from the tree hash and the sha256 hash, when regenarating the request?
Note: it is not simply the hash of another block of the file, I checked that. It's a totally new one.
I expect this payload hash is recalculated by boto; for all other requests (e.g. create vault) it has to be calculated too. So likely they simply calculate it.
If so, the next question: why has the payload changed? This is odd, of course.
This is either a bug of Boto (losing part or all of the payload when retrying? threads should be fully separated though) or an issue related to mmap where the mapping is inconsistent for some reason (very unlikely).
Yet Amazon's reply suggests that the data itself is actually sent correctly. They calculate the hashes themselves, too, and the hash values that are returned by Amazon match the local values. I assume at least this are hashes that Amazon calculated using the received data.
The text was updated successfully, but these errors were encountered: