A Data Center Architect's Journey into Generative AI

My technology expertise has always been centered on data center technologies with a focus on areas such as compute, storage, virtualization, converged, hyper-converged and NFV. In the past couple of years, I have gotten more involved with container technologies, mainly through Docker, VMware Tanzu, and most recently, Kubernetes. When ChatGPT launched in late 2022, I admittedly didn't really think much of it, until I got access to the OpenAI beta and experienced it.

Using ChatGPT firsthand was truly an eye-opening experience for me. Working in the Advanced Technology Center (ATC), here at WWT, we are very fortunate to be at the forefront of all types of emerging trends in technology, and AI has been no different. For many years, the ATC has been investing in areas that enable what we are seeing in AI today. Things like the latest GPU's, DPU's, FPGA's, high-speed networking and storage have been integrated into the ATC for several years. Our Data Science team has been in place for almost 10 years. After my first interaction with ChatGPT, it felt like this was going to be different in a much bigger way than anything in the recent past, so I immediately wondered how can I learn more about it.

How can you build your own ChatGPT?

At the most basic level, I knew that ChatGPT, Bard, etc., required large amounts of compute power, GPU/DPUs, storage and high-speed networking to do something, along with some sort of dataset. Could I just take consumer-grade hardware, an Apple Macbook Pro M1 and an Intel NUC with a Nvidia RTX 4070 GPU and make a local equivalent of ChatGPT? I did some quick searching and saw that you could.

People learn differently. If you want to build a plane, most would spend time reading everything they can about how to do that and then build it while others might want to build the plane while you're flying. I happen to fall into the latter category, so I set out to build my own local ChatGPT and figure out what everything was after I got it working. I googled, "run chatgpt locally offline," and one of the first posts that came up was a Reddit post that covered that exact topic – perfect, so I am off to a good start. It seemed like there were a few top contenders on GitHub that covered what I wanted to do, so I started looking at a few of them and how they worked. While looking at a few of the projects, the concept of LLM kept coming up. I did not know what that was, so I stopped to spend some time figuring out what LLMs are and why they are important.

LLMs, transformers, GPTs

What is an LLM? I figured I would ask ChatGPT4. This is what I got:

An LLM, which stands for Large Language Model, is a type of artificial intelligence algorithm designed to understand, generate, and interact with human language. These models are trained on vast amounts of text data, enabling them to comprehend context, answer questions, write content, translate languages, and perform a variety of other language-based tasks.
One of the most well-known types of LLMs is the Transformer-based models, such as GPT (Generative Pre-trained Transformer) developed by OpenAI. These models use deep learning techniques and have been trained on diverse internet text. They predict the likelihood of a sequence of words appearing together, which allows them to generate coherent and contextually relevant text based on the input they receive.
LLMs are used in a wide range of applications, including chatbots, writing assistants, search engines, and more, to provide users with relevant, conversational, and sometimes creative content.

A key thing I took from that was that LLM's are, "trained on vast amounts of text data." My interpretation of that definition was that it was basically large quantities of text data, but curated in a very specific way for a specific set of tasks.

What did that mean, where did the data come from and who or what trained it? Again, I asked ChatGPT4, which said that training an LLM, like the one used in GPT, involves several key steps, typically executed on powerful computing hardware due to the vast amount of data and the computational complexity of the models.

Data preparation:

Collect data: Gather a large dataset of text from various sources. This can include books, websites, articles, and other forms of written language.
Clean and preprocess: Clean the data to remove irrelevant information, such as HTML tags from web pages. Preprocessing steps might also include tokenization (breaking text into pieces like words or subwords), normalization (converting text to a uniform format, e.g., lowercasing), and possibly deduplication (removing duplicate entries).
Tokenization: Convert the cleaned text into tokens using a tokenizer. Tokens can be words, subwords, or characters. This step translates the raw text into a format that the model can understand and process.

Model architecture selection:

Choose an architecture: Decide on a model architecture. For LLMs, Transformer-based architectures are common due to their effectiveness in capturing the complexities of language.

Model training:

Initial configuration: Set initial hyperparameters such as learning rate, batch size, number of layers, and model size.
Pretraining: Train the model on the prepared dataset. The model learns to predict the next token in a sequence given the tokens that precede it. This is typically done using a variant of the cross-entropy loss function, which measures the difference between the model's predictions and the actual sequence of tokens.
Adaptive learning: Adjust learning rates and other training parameters as needed to improve performance and reduce overfitting.
Evaluation and iteration: Regularly evaluate the model's performance using a separate validation dataset not seen during training. Use metrics like perplexity or accuracy on specific tasks to guide adjustments in training strategy.
Task-specific training: If the model will be used for specific tasks (e.g., question answering, translation), it can be further fine-tuned on datasets specific to those tasks. This involves running training on the model additional times on the new dataset, allowing it to further adapt its knowledge on each run to perform well on the specific task.

My main takeaway was that the LLM used is the key part of ChatGPT's usability, and the creation of an LLM from scratch requires massive amounts of compute time and data, curating skill to curate the data, and then to train the dataset to get a usable model. It seemed like creating my own LLM, while technically possible, was not going to be realistic on my timeline and hardware, so I started looking into where I could just get a model that had already been created. That is when I found out about LM Studio, Jan and Hugging Face.

LM Studio, model sizes and Hugging Face

While researching how to build and run my own ChatGPT, I knew I would need to have some mechanism for running an LLM on some form of compute. I had the compute in my gaming PC at home which had a 13th Gen Intel Core I9, 64GB RAM and a Nvidia RTX 4070ti GPU, so I started looking for software to run an LLM locally. While I am a Mac user in my day-to-day life and comfortable with a CLI, I specifically picked Windows in this case, since I had the gear and I wanted something that I could get up and running on quickly. A quick search and I came back with LM Studio as a top result.

LM Studio:

Software freely available for Windows, MacOS, and Linux that allows you to do the following:
- Run LLM's on your local machine, entirely offline
- Use models through the in-app Chat UI
- Run an OpenAI-compatible server to interact with the model directly thru API
- Utilize dedicated GPUs to run LLM's locally.
- Discover new and trending LLMs through the app's home page and download them.

I downloaded LM Studio and installed it on my PC. Once it was installed, I started looking at the type of model I wanted to download and try. While LM studio has a good interface for finding models, I did not know what everything in the names meant as far as compatibility for running locally, most of which came down to size, both for the LLM file itself and the amount of resources on my machine. I needed a brief explanation of what the terms meant before I proceeded.

Model size and quantizing:

Every model I saw on LM Studio, in addition to the actual model name/type, had both a "b" and a "q" value in the name. An example name could be "Mistral-7B-Instruct-v0.2.Q8_0.GGUF."
Models come in sizes, which are referred to as "b" sizes. the b stands for the number, in billions, of parameters within a model. The larger number generally means a "smarter" model, as well as a larger file size. In the model above, the 7B means it is a 7 Billion parameter model.
Quantization, or the "q" number in the name refers to the process of reducing the precision of the numbers used in the model's computations. This is usually done to reduce the model's size and speed up its operation, especially for deployment on devices with limited computational power. In the example above, the model is using an 8-bit quantization.
The general goal for performance, is to find a model that does not exceed the GPU memory in whatever machine you are using so that the GPU is doing as much of the work as possible.

Hugging Face:

Hugging Face is the Github equivalent for LLM's.
Large open source community that contributes, creates, tunes and maintains hundreds of different model types and can make those publicly available.
The go-to destination for most popular LLMs.
When using the search on LM Studio it is searching Hugging Face.
When searching for models optimized for consumer-grade GPUs and hardware, append GGUF to your search to find models specifically for consumer grade.

LM studio has a nice feature that while you are searching for models; it knows the resources available in your machine such as system RAM and GPU VRAM, and will tell you if full GPU offload is possible. After looking through the various options I chose a chat-optimized LLAMA LLM with 7 billion parameters that would run fully within the 12GB of VRAM in my GPU and let it download.

Running a LLM

Now that I downloaded the LLM I wanted, I loaded up the local chat feature within LM Studio using the out-of-box presets. There are a multitude of parameters that you can change for any model, so I am not going to cover those here except to say that I did enable GPU offload and ran my chat queries both with and without GPU offload to see the difference and it was night and day – the GPU enabled model was much faster than without any GPU. I was able to enter in text-based queries, very much like one would in ChatGPT, and get responses, based on the LLM in use.

The results were surprising, and while it is heavily LLM dependent, similar accuracy as to what I was seeing in the paid version of ChatGPT, when doing a side-by-side comparison of some AWS CLI commands I was trying to learn for another project. While I have a PC with an Nvidia GPU, I also have an M2 Macbook Pro 19 GPU cores, so as a test I loaded up LM Studio and the same model, and the results were similar on my Macbook.

What else can you do?

Replicating the functionality of ChatGPT locally for free was what I originally set out to accomplish, and I would consider these results successful. Throughout my research on seeing if I was able to accomplish this, I came across many other things that I now also want to try to replicate on my own equipment as I learn more and more about AI. I've listed a few of them below for further reading but really the possibilities are endless.

Look more into open source alternatives to LM Studio like Jan.ai and others.
Using local LLM via OpenAI Server to check and optimize code within Visual Studio Code using Continue extension.
Using AutoGen Studio to automate tasks.
Use AnythingLLM to create local RAG functionality along with a local LLM.
NVIDIA has recently released a beta of NVIDIA Chat with RTX that offers a lot of similar functionality as LM Studio in terms of running local LLMs, along with RAG functionality to incorporate your own data into an LLM.