Hosted Llm

Why would you self host your model?

  1. Data Privacy and Security: You can ensure that sensitive information or proprietary data doesn't leave your organization's network. Some industries, such as healthcare or finance, have strict regulatory requirements governing data handling and processing. Self-hosting may provide more control over compliance with these regulations.

  2. Low Latency: If you need low-latency access to the model, hosting it on your own servers can minimize network latency, providing faster responses. Further, you can deploy the model in data centers that are strategically located to minimize latency for those users.

  3. Customization: You can fine-tune the model on your specific data and train it to perform tasks that are tailored to your business. You can even run the model offline with no internet connection.

  4. Cost Control: Depending on the volume of requests and usage, self-hosting might be more cost-effective in the long run compared to using cloud-based API services, which charge per request or usage.

  5. Scalability: You can add more resources as your workload increases. This means that you no longer need to worry about OpenAI rate limits. Relying on third-party APIs or cloud services means depending on external providers. Self-hosting reduces this dependency and minimizes the risk of service disruptions caused by external factors.

It's important to note that self-hosting also comes with challenges and responsibilities, such as managing server infrastructure, ensuring model performance and reliability, and handling updates and maintenance. These tasks can be time-consuming and require specialized knowledge and skills which is exactly why offload all this work to Mano which provisions monitoring infrastructure to scale your fleet based on your usage:

How much would it cost to self host on GPU?

Server type is the primary cost factor for hosting your own LLM on AWS, the cost of the server depends on the model chosen and memory requirements. The following table shows the cost of hosting a single LLM model on AWS EC2 instances with different configurations. We used a quanitized version of llama 2 to reduce memory requirements and cost:

ModelInstanceLatency (ms/token)Throughput (tokens/s)Cost ($/mo)
Llama 2 7Bg5.2xlarge34.24120$885
Llama 2 13Bg5.2xlarge56.2371$885
Llama 2 70Bml.g5.12xlarge138.3433$4140

Can you run the models on CPU to save costs?

  1. Slower Inference: CPUs are generally slower than GPUs for deep learning tasks. Inference (generating responses from a pre-trained model) on a CPU will typically be slower, which can impact the responsiveness of applications using the model. It's roughly 4 times slower than GPU when optimized for number of threads and quantized (see picture below for number of tokens per second for different 7B models).

  2. Cost Savings: You will save money by using CPUs instead of GPUs. Using quantization you can fit the models in smaller memory instances and save even more money. You will have to max out the core CPU utilization to achieve high token throughput and low latency. Libraries like llama.cpp (opens in a new tab) and ggml (opens in a new tab) simplify the process of running LLMs on CPU.

  3. Model Size: The size of the model matters. Smaller versions of language models may perform reasonably well on CPUs, whereas larger models may struggle due to their computational demands.

  4. Caching: Implementing a caching mechanism for frequently used inputs can reduce the load on the CPU by reusing precomputed responses.

Is fine-tuning possible with a self-hosted LLMs?

Yes, fine-tuning allows users to adapt a pre-trained LLM to specific tasks or domains, making it more specialized and accurate for particular use cases. Here is a brief explaination:

  1. Customize for Specific Tasks: You can fine-tune the LLM to perform tasks like sentiment analysis, named entity recognition, or any other text-based task relevant to your business or project.

  2. Domain Adaptation: If your data is domain-specific (e.g., legal, medical, financial), fine-tuning can help the model better understand and generate text in that domain.

  3. Improved Performance: Fine-tuning often leads to better performance on your specific tasks compared to using a generic pre-trained model.

Can I integrate a self-hosted LLM with my existing applications and services?

  1. API Integration: We provide an OpenAI compatible endpoint that can be used by developers to extend their existing services. We provide some SDKs that provide one liners to call your LLMs and perform inference.

  2. Workflow Builder: We provide a no-code workflow builder that can be used to build workflows that call your LLMs and perform inference. A workflow can be started by a webhook (when a form is submitted for example) or a cron job (every day at midnight).

Current Offerings

There are many players that provide LLM hosting services. Ultimately the choice of service will depend on your usecase, there is no one size fits all solution. If you have a heavier workload you are probably better with self hosting for cost purposes. If you have a lighter workload and want to get started quickly you are probably better with a serverless solution.

NameDescriptionSelf-HostModelsServerless
Anyscale (opens in a new tab)Anyscale Endpoints is a fast and scalable API to integrate OSS LLMs into your app.YesLlamaYes
AWS Bedrock (opens in a new tab)The easiest way to build and scale generative AI applications with foundation modelsYesLlama, Anthropic, AI21 and Amazon TitanYes
AWS Sagemaker (opens in a new tab)Build, train, and deploy machine learning (ML) models on AWSYesEverythingYes
Banana (opens in a new tab)Serverless GPUs for AINoEverythingYes
Beam.cloud (opens in a new tab)Train and deploy AI and LLM applications securely on serverless GPUsNoEverythingYes
Cerebrium (opens in a new tab)Serverless infrastructure for ML: train, deplay and monitorNoEverythingYes
Haven (opens in a new tab)Train and Deploy Open Source AINoEverythingNo
Hugging Face (opens in a new tab)Platform to host models and appsYesEverythingNo
Inferless (opens in a new tab)Serverless GPUs to scale your machine learning inference without any hassleNoEverythingYes
Modal (opens in a new tab)Cloud functions re-imaginedNoEverythingYes
Mystic (opens in a new tab)Serverless GPU inference for ML modelsNoEverythingYes
Replicate (opens in a new tab)Run models in the cloud at scaleNoEverythingNo
Scale AI (opens in a new tab)Run inference and fine-tuning on Scale's infrastructureYesLlama, Falcon, MPTNo
PyQ (opens in a new tab)Build and Deploy Customized Task Specific AI with EaseNoEverythingNo
Preplixity (opens in a new tab)Generates a model's response for the given chat conversation.NoEverythingNo