On February 4, 2022, at around 6:30-7 pm (EST) TinglyTube suffered an approximately 10-hour outage until it was brought back up around 5 am today (February 5, 2022), but how and why did this happen?
Well, while TinglyTube has a small but incredible development (and engineering) team, they are mostly off on the weekends unless specifically called in, leaving me (JJ15ASMR, Co-Founder, CEO, & CTO of TinglyTube) to be the "first responder" to any issues that arise on the backend of the site before our development team is called in to assist.
And there's usually never been an issue with it being that way, however, it relies on me to be awake during a majority of the day and constantly checking in on TinglyTube as I usually do. However, my second semester of high school started this week which has required me to start waking up much earlier again than I was during break (I go to online school which makes it better though :) because of this, I've been getting quite tired in the evening's unless I force myself to stay awake until a more proper time to fall asleep, but considering it's a Friday with nothing else I need to do, I decided to lay down for a minute, but setting no alarm for myself, that turned into a full 10 hours of sleep... all while TinglyTube went down shortly after.
And remember what I also said about needing to also constantly check in on TinglyTube? Well, that's a very important part of the equation because we don't have any server monitoring system in place to automatically check and notify us if something is down, not responding, or not working. Which should be in place, and is my fault for not. The closest thing we have in place is Munin, which is a simple open-source resource and network monitor, it doesn't alert us, but it does track and create graphs representing resource and network usage, which is how we have this graph showing our shameful downtime gap:
As said before, this is my fault completely as TinglyTube is my passion and I am responsible for making the big decisions regarding it which include how our infrastructure is set up and monitored. And I am extremely sorry to the TinglyTube community and our team for this horrible mistake of not having a monitoring/alert system in place.
Obviously, an apology is not the only thing needed to make this right, something also needs to be done to prevent this in the future, so as such, a proper monitoring and alert system have been set up this morning that will notify I and our development team through a variety of ways (email, SMS, & phone) in case of a future outage or other issue(s) so that they be fixed in minutes
not hours.
As for "why" this happened, we're not entirely sure yet, there are so many moving parts that play a role in making TinglyTube work that could have malfunctioned or possibly been overloaded, but the biggest priority was getting things back online, which was actually quite simple in a way because all that was needed was a forced restart of our server, and things were back up and running within about 20 minutes once everything started back up again. We are investigating to find the root cause though because if it's something we can also fix and prevent in the future, we will.
If you have any questions or concerns about the downtime/outage, please contact us via our
Support/Feedback form and we'll get back to you as soon as possible.
- Jacob Daniel (JJ15ASMR) Co-Founder, CEO, & CTO.
Comments
0