Celebrating the Hacker Mindset: Bob van Luijt on the Revolution of Generative AI
Cofounder and CEO of Weaviate shares his transition from studying jazz to building a vector database empire valued at $200 million
In startups, it’s rare to be seven years ahead of the curve and still have perfect timing. But Bob van Luijt had a striking experience in 2016 that led him to create Weaviate. He’s a connect-the-dots founder, a grand synthesizer, which no doubt helped him build the most widely adopted open source vector database in the enterprise.
I sat down with Bob to discuss the origins of Weaviate, today’s competitive landscape for vector databases, and how the enterprise is responding to shifting perceptions and appetite for LLMs.
When did you first realize that you were really interested in technology?
When I was in high school, there were two things I was very interested in. There was the internet and music. I got an opportunity to start a small business making websites when I was still in high school. When it came time for me to study, I chose to study music. I was still running software to make money but I was not interested in business at all. That was just not on my radar. Then, I got an opportunity to work with machine learning. I was struck by the magic of being able to calculate with words. We started asking whether we could take paragraphs of text and then take individual embeddings of those words. It was messy then. We started exploring application program interfaces (API) to integrate those things. That’s how everything started.
All kinds of things were coming together. Back then, I was part of a group that called itself a community, Google Developer Experts. As part of the community, I was invited to Google I/O. I was sitting in the audience in 2016 and Sundar Pichai says we will change Google from mobile first to AI first. He said that back in 2016. I had this moment of epiphany, where I knew what they were doing. They were using vector embeddings to create embeddings for webpages.
That language can be represented mathematically is very important for machine learning because it allows us to abstract away from words that mean things into words that mean numbers that we can do computations and predictions around. Embeddings, and vector embeddings specifically, are the most efficient way to represent these words as numbers. What was your intuition about why that works?
I like the word you used: Intuition. I mentioned that I studied music, first jazz, and contemporary music after that. If you write something down in jazz, for example, a John Coltrane song, and write down the chords that he improvises over, that does not mean that if you can read those chords you get it right. He wrote pretty complex jazz songs. It’s the same with the mathematical notation of vector embeddings.
In my mind, when I was studying music or when I was working with these models, it was the same mechanism. With a jazz song, you practice, practice, and practice, and all of a sudden you go, “Oh, I get it.” Then you start to know how you can use those embeddings, how they correlate to each other, and how you can fix them. For me, it’s a visualization. It doesn’t start from a mathematical perspective. It starts with playing around. If I work with technology, software, or music, I have that intuition. It’s hard for me to pinpoint what it is. The point I’m making is that we need to celebrate the hacker mindset, playing with something to investigate. You don’t need a PhD. I don’t need to have that to work with these machine models and build something new.
Okay. It’s 2016, you're at this conference, and you hear that Google will become an AI company. You then quite quickly start the Weaviate project. What was it and what has it become?
One of the things that I was interested in back then was the Semantic Web. I was with this group of Semantic Web members asking why people weren’t filling in the microdata. I slowly started to ask, “What if we let the machine-learning model solve this for us?” The first iteration of the API we had was based on Schema.org.
Two important things started to happen. First, people started to ask if they could do this based on their own data. So we let go of any restrictions in the schema so they could use it with any data they had. The second thing was figuring out how to add this to a database. Kudos to my cofounder Etienne Dilocker, who thought that we could make vector embedding a first-class citizen and create a dedicated database for it. He started to work on it from the bottom up and I began from the top down. By January 2021, Weaviate was a standalone database, where we had the vector embedding as the first-class citizen.
Let's talk for a moment about the company. Where is everyone located and what are you focusing on these days?
We have been 100% remote from day one. We never had an office. That’s still the case. We have one person in Australia and for the rest, it’s half and half in the United States and in Europe. If you’re a vector embedding research scientist, I don’t really care whether you live in Hawaii, Tokyo, or Madrid. We have a cell structure in the company, so we don’t have a traditional hierarchy. We have cells where people work, each with a head. That’s how we’re structured and it’s allowed us to keep the organization as flat as possible for as long as possible. When I first started the company, I was told that people are the most important. Now, I understand.
My focus is on product, go-to-market, and marketing. That’s what I fill my days with. Listening to the community, listening to customers, and thinking about how the product evolves and where the market is going.
What are people paying you for right now?
Our go-to-market is two-fold. We have both SaaS and we have Bring Your Own Cloud (BYOC). SaaS is SaaS as you know it. You sign up and swipe a credit card, then you can use a database. Our bigger customers actually use BYOC. That means that Weaviate runs inside their virtual private cloud on AWS, Azure, or Google Cloud. There, we charge for memory and CPU usage. It's important to bear in mind that an open-source-based model is actually very simple. Basically, if you have production workloads, you can pay us. We are really good at operationalizing our own database. That’s the proprietary software stack we have. The operation of the cloud and the hybrid cloud.
What is your vision for Weaviate five years from now and what do you really want to build here?
People are spending part of their lives at this company. They join early and they’re part owners of the business. I want people to have fun and I really mean that. There are sleepless nights here and there but in the end, I want people to look back at their time at Weaviate and say that they met great people and had fun. From a company-building perspective, that is the most important thing to me. I was so proud this week that somebody joined and wanted to have a conversation with me. We had a chat and she said that coming from the outside, she knew that Weaviate prides itself on having people that are kind and nice. After joining that week, she said that she can feel that. I think customers see that too.
I want to win this marathon and I want to do the business building. This is indeed also a creator thing. I want to do that well. I don’t want to build a $7 billion company. I want to build a $70 billion company as a database company. I want to figure out how we can do that, the people we need, and what we need to do in-product. One of the things I didn’t understand in the early days, for example, when I looked at companies like MongoDB or Elasticsearch, was why they didn’t just focus on search. I didn’t get that. Then I learned that the way you position the product is to package it, and if you want to sell CPU memory you have to package it a certain way. That worked very well for them. I knew that if we wanted to build an effective database, that wouldn’t work for us. We were a new category.
Then we learned that, with these large generative models, they’re stateless. They’re trained somewhere and there’s a cutoff point and that’s the state you get. So you can give a state to retrieval augmented generation (RAG) by connecting the database to the model. And that was our thing. What observability and logging are for Elastic, generative AI will be for us. That’s the role we play in generative AI. That is a form of business building that I like. That moment of epiphany to figure out the position of the product and mission of the company, then working towards that. It’s like running a marathon, doing it well, and winning it.
I want to focus on that last point there. The role that Weaviate can play in relation to a large language model (LLM) enables the LLM to keep evolving.
In software, we have two types of applications or pieces of software. One piece of software is a stateless application and another is one that is stateful. For example, let’s say I have an Excel file and I fill in data. When I open that file again, I can see the data that I filled in. Therefore, Excel is stateful because it keeps the state of what I’m storing. An MP3 file is stateless. If I listen to it, it doesn’t change. If I copy and paste it and send it to you, you can also listen. It’s stateless. It turns out that in software, stateful and stateless applications, solutions, and businesses have very different characteristics in how they operate.
A machine learning model is stateless like an MP3 file. That means that the go-to-market motion we see with these models looks very much like what we see with MP3 files. OpenAI is like a big Spotify of machine learning. You need to somehow keep this statelessness by blocking it off. Otherwise, the problem is that if you copy and paste it, you have to value it twice. But people are not going to pay you twice. So you need to find a different way to do that. This is a common problem with stateless technology. When you have a model that is stateless, the problem that emerges is that the model has language understanding and knowledge. But it also evolves slower, for example, because of slang. Regarding statefulness, if you ask who won yesterday’s match between so and so, it can’t tell you that. It doesn’t have a state. So it does research and that’s what is called retrieval augmented generation.
We can make models that know that they need to ask for something from the database. What comes out of the model are called vector embeddings. What is sent from the database back to the model to generate something are also vector embeddings. They interact with each other. The business opportunity is related to realizing that 98% of data is behind closed doors. If you want to have models based on proprietary data, use the database in combination with the model. That’s how this works.
I have the sense that over the past months, some of the initial hype of generative AI has faded into a pause at the enterprise level to figure out the company’s strategy. It seems like there is more of a healthy inquiry into where this is all leading. Are you seeing this same trend as well?
I always say there is a pre-ChatGPT and a post-ChatGPT era. For whatever reason, it was the perfect storm in November last year when they released ChatGPT. Because of the quality of those models, a paradigm shift is happening. We’re really starting to interact with these systems in human language.
In the post-ChatGPT era, organizations who build digital products realized that ChatGPT would eat their lunch if they don’t speed things up. ChatGPT was an eye-opener for non-technical senior people in organizations, showing them that something is happening. Now, we go to those people and talk about their generative AI strategy. We can say, “If you have one, we have the core infrastructure to power it in your organization.” Now they want to listen.
One of the problems that we have with AI is also one of the beautiful things. People start to anthropomorphize AI because they can talk to it. It’s like if you have a rock and draw a face on it. You say, “Hey, the rock smiles, the rock is happy.” That’s what is happening right now with AI. That’s sometimes nice because it generates interest. But the problem is that people start to see things in this way. They forget that it’s just a statistical model predicting word after word.