Describe what you did during the internship in 3 sentences?

During my internship, I improved the database reliability and reduced operational burden for DB teams at Slack! I migrated a table from the legacy database to a more reliable database system, which I finished 1 month earlier than expected. And fun fact: this internship has changed me from a shy person to a motivated communicator!

Tell me about your project?

I migrated a table from an old Master-Mater database to a more reliable database system called “Vitess”. Having key tables on this more robust system enables the company to support more data with flexible scaling and less fail-over.

It was a 4-month project but I finished 1 month earlier! This table I migrated was the last table on that old database, so I proceeded to clean up other dependencies to deprovision the old database! This saved great costs for maintaining the fragile old database.

How I did that?

Planning! I designed a technical spec including why to do the project, major components, how to perform it in steps, risk analysis and management, trade-off (like which “keyspace” to choose) to consider. I also made a plan with tasks broken down for each week. I presented the project plan to teammates and collect feedback from them.

Collected insights! I analyzed on Data Warehouse for: QPS (avg and peak), traffic types, what are the most frequent queries. I also searched for all the usages in the codebase. With insights collected, I made decisions on schema and other configurations.

Encapsulated! There were some really old SQL call sites that were exposed in application code! Not a good practice, right? I built a DB access layer to encapsulate all SQL calls to the table. Backend engineers can just interact with this library to access the database without knowing any details about changes. 

Moved data from the old database to the new system! I did this by enabling dual read/write in various modes: at first, all calls would be delegated to the old database, and then dual write, and then dual read/write with the source of truth changed, all the way to only reading from the new system. For every single mode change, I first did that on the development environment, then on production after verifying things going well; I compared data to validate, and monitored traffic with graphs and Data Warehouse.

Table migrated! Because I finished the project assigned to me early, I continued to clean up the dependencies to deprovision the old database. I removed the code references after careful examination; I collaborated with other teams to stop sourcing data from the old database.

What is the most difficult problem you have faced and how you solved it?

I was totally new to database and infrastructure! Before going into this project, I was really worried about uncertainty. How to do this and that? What could be the edge cases? How long would it take?

One wired edge case I encountered: After getting into the double write mode on dev, I ran the comparing job to validate the records on both the legacy database and the new system were the same as expected. However, it was not! The number of “records not in the new place” was non-zero and kept increasing. Wow, how to face that?

1. I estimated the situation with the team and decided whether to roll back. In this case, since it was on dev and only “double write”, it was safe to keep it there to monitor.

2. I came up with possible causes: mystery branches not on the latest master? Bugs in code? Some environments with different settings?

3. I investigated by visualizing the difference, checking the code, searing in db console, logging errors on logstash. I checked where those different records were used. 

4. Oh, I found it! QA environment had different configurations! A QA specific issue! 

5. Discussed with teammates to validate the QA issue; timeboxed the investigation for 2 days, and then made the change on production. Oh, zero diff now!

So I moved forward by unblocking myself with investigating like a detective!

How did you communicate?

I wrote a clear technical spec to describe the project and organized a meeting to present it and collect feedback.

I summarized my progress on the project, blockers and plans, and shared it with my team at the end of every week. I initiated to do this so that I could save my teammates time and also provide a clear picture of how the project went.

I was not afraid of asking questions! However, I made sure every time I asked a question, it was with courtesy to others. I described my question, what the motivation and situation was, and what I had tried. I asked questions on Slack or in person. For the latter, I wrote down summaries before I asked, and kept notes for questions.

I explicitly asked for critical feedback from my mentor and manager on a regular basis. I was eager to improve myself and contribute more to the team.  

What are you looking for in the next internship?

I can be more professional! Go in deep and solid, for whatever I will be doing. By the end of the internship, I would be able to optimize the system I am working on, and give a deep tech talk on that! 

Great mentorship! Teammates are knowledgeable while keen on continuously learning, and they are willing to spend time guiding new members on how to onboard and how to learn. 

AMA For My Internship At Slack!