A quick deep dive into recent AI tools

GPT-3, DALL-E, Stable Diffusion, and more… what are these?

You’re going to see the three letters LLM a lot over the next sixth months, kick started by OpenAI’s release of GPT-3 and DALL-E. More specifically, the release of their newest “davinci” model within GPT-3 and DALL-E 2. Both of these have shown a huge jump ahead of it’s predecessors. In addition, we’re seeing a wide range of alternatives popping up which I’ll get into below.

With the release of such powerful tools available to the public, there are a dearth of new opportunities to tackle and many risks and concerns – both to individual businesses and society at large.

Quick Intro

Before we dive in, what is LLM? It stands for Large Language Model, which is a subset of Language Models (LM), which is used in Natural Language Processing (NLP).

In short, GPT-3 is a generalized AI trained on enough data that it can do a wide range of tasks – everything from answering historical facts, understanding complex instructions, and writing code or poetry. All of this, out of the box. DALL-E is a subset of GPT-3, training specifically only to generate images based on text.

Text-to-Image Generators

My experience in this started with Midjourney, which you can access on Discord here. Compared to DALL-E, it’s a bit more stylized. Google also has two models called Imagen and Parti, an autoregressive and diffusion based model, respectively. They are more closed with their technology and do not make it available to others, but are constantly sharing improvements and tools such as Dreambooth (AI output based on reference image) and this Parti/Imagen hybrid.

I ended up spending more time with DALL-E, likely because the web-based UI was more intuitive to me. I’ve done a couple of projects: PixelBeast “glow ups” and Animal Buildings. I often do DALL-E sessions with my kids. It’s a great place to start learning how to write and adjust prompts, as the response is visual – making it intuitive to understand the impact of adjusting a prompt. I wrote about this new skillset, which I learned later is called “prompt engineering”. To up your DALL-E skills, read this 82-page DALL-E prompt guide.

The recent hot shot is Stable Diffusion by Stability.ai, which is open source and can fit into 10 GB of memory. You can try it at DreamStudio – their web based tool (API coming soon). It has more settings than DALL-E. And because it’s open source, people are building plug-ins in photoshop (example) and Figma (example), and extensions like this face correction tool. There’s also img2img, which allows you to provide a reference image. People have, for example, used this to add realism to their kids drawings (try it yourself here – it will be in DreamStudio soon). According to Emad, Stability.ai is looking at building for audio, 3d, and video.

This is just the tip of what I’ve seen. I’m hearing of professional designers using these tools for inspiration, or a base layer for their art. One gaming use case might be layering this as the rendering engine (here is a GAN layered on top of Minecraft to make it photorealistic). Some tools/startups integrating Stable Diffusion include Artbreeder (drawing app w Stable Diffusion), Accomplice (AI generated stock photos), and less direct – but there’s also PromptBase, a marketplace for AI prompts.

An extreme use case of this is Curt Skelton – a fake influencer created by combining two real people in DALL-E, using those generated images to create a 3D face and avatar, and running this through DeepMotion to animate the character. This turned out to be a fake stunt – Curt is a real person. But the ability for anyone to do this is clearly not far away. (Just look at ObEN, Soul Machine, Didimo, Inworld, Synthesia…)

GPT-3 and more

I thought my mind was blown with DALL-E until I started playing with GPT-3. I covered some GPT-3 powered startups back in May last year (Copy.ai has since surpassed 2M users). Since then, a new model called davinci launched, which is decidedly more powerful than anything that’s been available on the market.

It can do everything from answer questions, be a chat bot, summarize text, to explaining and writing code (see Codex which powers Github’s Copilot). The reason the davinci model is so powerful is because it is trained on 175B parameters (compared to 6.7B parameters for Curie – it’s second most powerful model, which people were playing with when GPT-3 first launched).

A good place to start is the GPT-3 “Playground”, where the UI is a simple prompt box. Simply type something in and click “generate”. Once you’ve played with a few examples, you can tweak the settings to see the impact on the results. Where it gets really powerful is when you start fine-tuning the model, which is where you upload a large set of prompt/responses – to create a custom model specialized for your use case. To build an AI powered app, you simply call your model with a prompt via API – which is how I am building Mini Yohei.

It’s worth noting here that Google has scaled their more secretive PaLM model to 540B parameters, and the results supposedly show clear improvement. MT-NLG is the new one by Microsoft/Nvidia trained on 530B parameters. There’s also OPT-175B by Meta (175B param, which the open-source BlenderBot3 chatbot is based on), Bloom by BigScience (open source, multiple languages, 176B params), Chinchilla by DeepMind (70B params), GPT-NeoX-20B by Eleuther AI (20B params), and GLM-130B by Tsinghua University (bi-lingual, 130B params), PanGu-α by Huawei (200B params), Jurassic-1 by AI21 (178B params), Cohere.ai (backed by Tiger Global, Index), amongst many others.

What does this mean for founders/VCs?

Given the power of this new technology, I believe it is important for VCs and founders to at least understand its capabilities, which will undoubtedly continue evolving. It will be a distraction to some – for example, Obviously AI is likely a better tool for many AI tasks. In the short term, there is ample opportunity to package up the LLMs capability nicely to make money. In the long run, I’m excited about tracking founders and innovators who are building strong teams and relationships in the space, participating in the discussions on how to push these tools to the limit, and thinking about what can be built – not just today – but in 5 to 10 years.