Hi everyone!This topic is now outdated. Please read the patch notes for v1.0.
Just wanted to do a quick write-up of what the (fuck) has been going on with the site's loading times, the endless CAPTCHA challenges, 503 timeous etc. that you have all had to suffer pretty much since launch. Sorry if I haven't been that communicative recently, or been short with some of you.
As far as I am aware from my own testing, and reports from members here, the server now seems to be coping with the amount of traffic in this peak evening time. It's still not lightning fast, but it is at least serving requests to people now, and delays are decreasing over time. I'm not really sure what happened and in which order things happened precisely, but it has been a very confusing last week for me, of which I've spent almost every day talking to my hosting providers, and collaborating with various d2io members here.
On launch day there was quite a crowd on here, about 500 or so concurrent, and as soon as the clock struck the hour most people logged off to go and play the game. I was very happy with the demands made on the machine at that point. However, completely unexpectedly the night before, Google had changed their algorithm and placed d2io in the top five results for some very popular keywords like 'diablo 2 resurrected runewords', 'diablo 2 trade', and because of the AVX issues 'diablo 2 won't launch'. This had a pretty serious effect when people began playing and searching the internet and the amount of traffic coming in was just nuts. Basically the site went from having about 100,000 requests in 24h to over 6 million requests in 24h.
The site began slowing down pretty significantly and I got talking to hosts about it after some initial troubleshooting of my own failed to solve the issues. Was shown my apache access.log and there were some specific IPs absolutely hammering the site - we were in agreement (hosts and I) that this was malicious, so was advised to configure CloudFlare as a measure to stop the DDOS with a CAPTCHA challenge and the traffic absorption CF offers.
Things didn't really improve though, and the longer I talked to hosts about it, the more I realised that actually there was just a metric f tonne of people on the site, with trades being listed seconds after one another, people registering, posting, commenting, autolinks spawning everywhere, tooltips being generated, just a whole load of requests. Hosts recommended I upgrade the RAM on my VPS instance at this stage saying it might alleviate the resource consumption of the instance.
This didn't work either - and was also having to deal with bugs introduced by the CNAME method of routing cloudflare, which meant adding www. to the domain name, which yeah just caused a lot of issues elsewhere. In the meantime I was trying to debug what is going on, but I couldn't even do that because the server was running so slowly due to the traffic. So it was kind of a nightmare scenario - thousands of users visiting and joining but unable to enjoy the site due to the slowness, and with each hour passing it became more and more difficult to try and fix things because I couldn't test anything.
Ordered the dedicated server at this point on the recommendation of the hosts - thank you everyone who has donated btw - it was supposed to arrive after one day, but instead it took three days - hosts said they didn't have enough machines available when I ordered, hence the delay. When it arrived and was set up, there were still a few hurdles to jump as the migration didn't go perfectly which was to be expected.
Performance however didn't noticeably improve after finishing the set up. Panicked calls with hosts, and in the discord chat, and a couple of people came out of the woodwork to offer help. During this time I had some amazing assistance from @Noemard and @Byte-Size who were able to help me talk to my hosts, identify malicious attacks, identify performance issues, gathering a lot of data about slow queries, unoptimised tables, lack of indexes, and way more performance related issues on the back end, and basically they managed to get a grip on the situation and find quick solutions. You guys literally saved this site from going dark, so thanks.
We found one or two rogue queries that had gone mad, and were taking so long that they crashed the MySQL server, causing restarts, causing CPU to spike, causing people's requests to time out, and just yeah, causing pandemonium. These high consumers were cut out, and things began picking up this morning afterwards. Clearly the code I wrote backend was not suited to this kind of explosive growth - I think the buzzword these days is 'scaling', and it just melted when both the organic and malicious traffic continued increasing.
So yeah, some surgery had been done and some very unoptimised parts of the site have been removed temporarily - autolinks aren't currently prettified or tooltipped when used in a post/comment/trade context, and notifications for trades being sold are currently disabled. The number of online users now only represents the last 1 minute of activity, whereas before it was the last 10 minutes. Work is continuing to optimise these functions so that they can return without causing the site to lock up again. You'll have to do without some luxuries for now, sorry
The malicious traffic has abated too - Cloudflare logs that since the start it received 42 million requests for d2io, of which 12 million were malicious attacks. This number has decreased now as in addition to CloudFlare I was able to determine the offending IPs and firewall them. The site is no longer considered as being 'under attack', so no CAPTCHAs are issued. Yet the site remains connected to Cloudflare which operates as a CDN, and on top of that, in case a DDOS starts up again, I can take steps much more promptly now to defend.
To conclude, this was mainly my fault for not having designed with scaling in mind or prepared a contingency for a situation like this. It wasn't clear to me what was happening a lot of the time, in part due to my inexperience with this kind of stuff and in part due to just being unlucky with the timings of things. A lot of different things went wrong at the same time and it was definitely overwhelming.
So I'm sorry for that, I've learned a lot of stuff recently. I think the worst of it is over now. Optimisation is continuing all the time, continuing to monitor the server and will be reintroducing those cut-out features again when they're ready. I know that speed isn't quite there yet, but it's getting there now.
Teebling
6938
Admin