Internship! - 28/07/2024

My first internship!

Moving Forwards

Follwing the successful projects I was part of during my role as an SDQS at Cohere, my colleagues and I were offered internships at Cohere on the synthetic data team under an amazing manager who saw the potential in our work. This moved us out of our contractor roles and gave us many new affordances, and responsibilities. As of writing I am still in this internship position and will be wrapping sometime in December of 2024. Overall, the role of SDQS was incredibly formative for my career and I am constantly grateful for the lessons learned and the opportunities given to me.

A Whole New World

For context I’d recommend reading up on my other posts about my time at Cohere if you want the full picture

Starting this internship was abrupt, one day I was a contractor, the next I’m meeting dozens of brand new people and attending a mountain of onboarding meetings while continuing work that I had already been doing on a web app. While my first steps into this internship, and how I got here, are anything but the norm I will say that the process was smoother than anticipated. Cohere does a lot for its interns in my opinion, I would say that the entire experience so far has been phenomenal. It feels easy to talk to people and get information about how to collaborate and align on tasks. The ‘social hierarchy’ one would expect to come with working at a company focusing on machine learning, a notoriously complex and knowledge gated subject, is difficult to find when interacting with others.

Whatever it takes.

What I Do

My internship has me working on the Synthetic Data team where we focus on a bunch of things. Firstly let’s address a big question,

What is synthetic data?

Synthetic data is what we call data points generated by large language models, or any other means, that are not human generated. We need synthetic data as the volume of usable data for training large language models is diminishing rapidly. Synthetic data generation is a careful balancing act between quality and volume. On one hand using a model to generate data is much faster and cheaper compared to asking humans to hand write the data, on the other hand models are known to hallucinate, repeat themselves, and even having undesirable patterns in their outputs (one of the common ones from chatGPT is using the word delve). Generating data that is out of distribution, high quality, and targeted is the goal.

Alongside this I have also worked on some model evaluation work. We largely do not understand how large language models work on the inside. We can’t really just open them up, for this reason LLMs are often referred to as black boxes. One thing we can do is benchmark them to get an idea of how they perform. To do this we need 3 things, a model to test, a set of questions, and a set of answers to those questions. We can then ask all of these questions to the model and evaluate how its responses stack up against the answers. This becomes tricky though when the answer isn’t easy to verify, for example Give me the answer to 34+7: is a lot easier to verify in regards to correctness compared to Write me a 500 word email to my professor about why I should be given an A+. There is also the approach of human evaluation where humans pick the best response out of at least 2 from different models in a blind test. This is how leaderboards like LMSys work. While there is no ‘perfect’ approach to figuring out if one model is better than another, we have relatively okay proxies to rely on for now.

When a measure becomes a target, it ceases to be a good measure.