Lock wait timeout exceeded - not a bug, but rather sql server setting / slow query • diablo2.io

7

Elektroputzi `77`

6 months ago (pre-Resurrected)

Device: PC
Browser: Firefox 146.0.1
OS: Windows 11 Pro 10.0.26100

Steps to reproduce:
- make a donation
- reload page while logged in

Anything else to add:

Not really a bug and not sure, if it really is worth investigating. Never had the error before under any other circumstances and might be specific to an UPDATE query issued when making a donation as a logged in user. Query probably ran at 12:55 CET. Might find it in the logs, if slow query log is enabled.

Solution might be to:
- optimize the specific query
- set indexed on relevant tables, if not already done
- increase innodb_lock_wait_timeout in my.cnf

No biggie, feel free to close the thread, if not relevant.

Fixedby Teebling • 6 months ago•Go to post

This is fixed now by the way, but to answer some of your questions, as this may be helpful to future readers as well:

Elektroputzi wrote: 6 months ago
we are talking pure file cache

Yep, just a pure file cache with opcache sitting on top to compile bytecode - there is no object cache in memory like redis.

Elektroputzi wrote: 6 months ago
[*] if a user is assigned to a new group, something is done with the file cache. (my understanding was it first is purged, then rebuild)

Yep, phpBB will attempt to scan and then purge the file cache in its entirety for a variety of reasons - in your specific case it was because your user colour changed when you are added to the Donators group. The cache always starts rebuilding as soon as the old one is purged.

Elektroputzi wrote: 6 months ago
[*] the cache operation is taking very long, so that the program stalls too long and the db query runs into the configured timeout -> the user is getting the error, when trying to load the page, while still logged in and the process is still running (prolly due to the update command locking the table, the select query cannot fetch required data)

The slow down was not the building of the new cache, it is the scanning of/deleting of the old one (which at the time had hundreds of thousands of files in it) that was locking the table for too long, leading to a timeout error. That's why you received a lock_wait_timeout_exceeded error.

Elektroputzi wrote: 6 months ago
[*] whatever is done to the cache, is it done to the whole cache, or per user?

phpBB attempts to purge the entire file cache - it does not balkanise the cache on a per-user or per-table basis.

Elektroputzi wrote: 6 months ago
The idea was to replace the php function with something that renames the old folder, creates a new folder and use that instead, trigger the deletion of the old folder, but not wait for it to be done.

Yeah something like this way you suggested would have worked way better than the way phpBB does it natively:

Code: Select all

mv /cache/production cache/productiono
nice -19 rm -rf /cache/productiono

So that would have been a good fix actually, preventing the table from locking and allowing the transaction to finish anyway. I ended up going with a different approach though.

First I had to grep through my typical file cache to find patterns of which queries were being cached so often and so many times. Then once I had found a number of culprit queries that looked similar to each other, I had to figure out which files were issuing these queries and asking them to be cached. Then I refactored a bunch of the old database code from 2023 that was responsible to make it 95% more performant than it was before, and therefore not requiring so much cacheing. This was also a classic N+1 exponential growth problem where the way in which I was cacheing the queries was leading to unique cache files per user per item per runes within item per this per that etc. etc. which was leading to the file cache filling up with an extraordinarily large number of files and very quickly.

After rinsing and repeating this process a few times, and ensuring cacheing remained in a non-proliferous way for the heaviest queries affecting page load speed, I was able to get the cache back to a normal looking state - where it stays around 500-1000 files in total at any time - which is what it's supposed to look like for phpBB - a number that phpBB's native cache purge behaviour has no issues dealing with - meaning no more timeout wait times being exceeded! This underlying issue fix in turn also fixes a bunch of other problems that were happening with timeouts elsewhere in the experience.

Thanks for your help and your report Elektro! Marking resolved.

7

Teebling `9555`Admin

6 months ago (pre-Resurrected)

Hey Elektro,

Thanks for your report (and your donation!).

What is actually happening is that the site has too many cache files, and as part of the transaction when you donate, it tries to purge that cache after adding you to the Donators user group, but it takes too long (because there are too many cache files) and therefore the transaction times out as per the lock_wait_timeout setting.

This is part of a larger bug that I am actively investigating (to find the root source of the cache proliferation issue). The tables are indexed and my.cnf is configured correctly. Any other ideas?

Donate diablo2.io administrator • Development Log • Site Tips & Tricks

7

OP

Elektroputzi `77`

6 months ago (pre-Resurrected)

Good question. What kind of cache are we talking about?

Maybe as a dirty workaround first rename the cache folder and create a new one, then delete the old cache folder. Renaming will be way faster than deleting recursively and rm can run on the old folder as long as it needs, since it is not in use anymore.

Something like:
mv /path/to/cache /path/to/cache_old && mkdir /path/to/cache
rm -rf /path/to/cache_old

7

Teebling `9555`Admin

6 months ago (pre-Resurrected)

Cache is a file cache, not redis or anything like that. Purged regularly already but thanks for the suggestion. The issue is not the ability to purge the cache, it is that the cache fills up too quickly

Donate diablo2.io administrator • Development Log • Site Tips & Tricks

7

Teebling `9555`Admin

6 months ago (pre-Resurrected)

Anyway, appreciate the help - this is a known issue and I'm working on addressing it.

Donate diablo2.io administrator • Development Log • Site Tips & Tricks

7

OP

Elektroputzi `77`

6 months ago (pre-Resurrected)

Alright, I understand. So it is not the purge, but the cache build up taking so long.

Guess there are 2 ways to address this.

change the program logic to not use cache files, as long as the build up is not finished
increase cache build up speed, by either:
- using CPU with significantly higher frequency (assuming php is used)
- using a faster caching algorithm or compression algorithm, or less efficient compression, if compression is a thing. will ofc lead to more disk space and more bandwidth being used

Maybe optimizing opcache can do a little, but I guess rather not.

However, happy hunting! glhf!

7

Teebling `9555`Admin

6 months ago (pre-Resurrected)

Elektroputzi wrote: 6 months ago
So it is not the purge, but the cache build up taking so long.

No, that's not it - it's that the cache is building up with tens of thousands of files too quickly!

Which then makes it hard for phpBB to clear them when it wants to (even with nice -19) I'm locating the queries that do this now.

Donate diablo2.io administrator • Development Log • Site Tips & Tricks

7

OP

Elektroputzi `77`

6 months ago (pre-Resurrected)

Ok, clearly I do NOT understand

I am a little confused here.

So this is what I think I understood correctly:
[*] we are talking pure file cache, no object cache involved (redis, memecached)
[*] if a user is assigned to a new group, something is done with the file cache. (my understanding was it first is purged, then rebuild)
[*] the cache operation is taking very long, so that the program stalls too long and the db query runs into the configured timeout -> the user is getting the error, when trying to load the page, while still logged in and the process is still running (prolly due to the update command locking the table, the select query cannot fetch required data)

My questions now would be:
[*] whatever is done to the cache, is it done to the whole cache, or per user?
[*] do you know the exact php funtion triggering the cache operation?
[*] can the code be customized or is it obfuscated. or is the risk too high to break something or to overwrite the changes with the next patch? (I am not familiar with the forum software)

Just to be clear, with my first idea I did not mean to purge the whole cache once or on a regular basis per cronjob or something like this. The idea was to replace the php function with something that renames the old folder, creates a new folder and use that instead, trigger the deletion of the old folder, but not wait for it to be done.

7

Teebling `9555`Admin

6 months ago (pre-Resurrected)

This is fixed now by the way, but to answer some of your questions, as this may be helpful to future readers as well:

Elektroputzi wrote: 6 months ago
we are talking pure file cache

Yep, just a pure file cache with opcache sitting on top to compile bytecode - there is no object cache in Memory like redis.

Elektroputzi wrote: 6 months ago
[*] if a user is assigned to a new group, something is done with the file cache. (my understanding was it first is purged, then rebuild)

Yep, phpBB will attempt to scan and then purge the file cache in its entirety for a variety of reasons - in your specific case it was because your user colour changed when you are added to the Donators group. The cache always starts rebuilding as soon as the old one is purged.

Elektroputzi wrote: 6 months ago
[*] the cache operation is taking very long, so that the program stalls too long and the db query runs into the configured timeout -> the user is getting the error, when trying to load the page, while still logged in and the process is still running (prolly due to the update command locking the table, the select query cannot fetch required data)

The slow down was not the building of the new cache, it is the scanning of/deleting of the old one (which at the time had hundreds of thousands of files in it) that was locking the table for too long, leading to a timeout error. That's why you received a lock_wait_timeout_exceeded error.

Elektroputzi wrote: 6 months ago
[*] whatever is done to the cache, is it done to the whole cache, or per user?

phpBB attempts to purge the entire file cache - it does not balkanise the cache on a per-user or per-table basis.

Elektroputzi wrote: 6 months ago
The idea was to replace the php function with something that renames the old folder, creates a new folder and use that instead, trigger the deletion of the old folder, but not wait for it to be done.

Yeah something like this way you suggested would have worked way better than the way phpBB does it natively:

Code: Select all

mv /cache/production cache/productiono
nice -19 rm -rf /cache/productiono

So that would have been a good fix actually, preventing the table from locking and allowing the transaction to finish anyway. I ended up going with a different approach though.

First I had to grep through my typical file cache to find patterns of which queries were being cached so often and so many times. Then once I had found a number of culprit queries that looked similar to each other, I had to figure out which files were issuing these queries and asking them to be cached. Then I refactored a bunch of the old database code from 2023 that was responsible to make it 95% more performant than it was before, and therefore not requiring so much cacheing. This was also a classic N+1 exponential growth problem where the way in which I was cacheing the queries was leading to unique cache files per user per item per runes within item per this per that etc. etc. which was leading to the file cache filling up with an extraordinarily large number of files and very quickly.

After rinsing and repeating this process a few times, and ensuring cacheing remained in a non-proliferous way for the heaviest queries affecting page load speed, I was able to get the cache back to a normal looking state - where it stays around 500-1000 files in total at any time - which is what it's supposed to look like for phpBB - a number that phpBB's native cache purge behaviour has no issues dealing with - meaning no more timeout wait times being exceeded! This underlying issue fix in turn also fixes a bunch of other problems that were happening with timeouts elsewhere in the experience.

Thanks for your help and your report Elektro! Marking resolved.

This post was marked as the fix.

Donate diablo2.io administrator • Development Log • Site Tips & Tricks

7

OP

Elektroputzi `77`

6 months ago (pre-Resurrected)

Teebling wrote: 6 months ago
...
After rinsing and repeating this process a few times, and ensuring cacheing remained in a non-proliferous way for the heaviest queries affecting page load speed, I was able to get the cache back to a normal looking state - where it stays around 500-1000 files in total at any time - which is what it's supposed to look like for phpBB - a number that phpBB's native cache purge behaviour has no issues dealing with - meaning no more timeout wait times being exceeded! This underlying issue fix in turn also fixes a bunch of other problems that were happening with timeouts elsewhere in the experience.
...

Very nice, as it resolves the root issue, bloated cache, instead of tempering with the code

Thank you for sharing and fixing!

7

Necrarch `3785`Moderator

6 months ago (pre-Resurrected)

Great work, thanks guys.

Can do trades on D2R or D2RotW
If you pay the requested price please comment and send me a friend request on bnet
Main: Necromancer / Second: Assassin / Third: Amazon / Check my stash tabs

Lock wait timeout exceeded - not a bug, but rather sql server setting / slow query