Hosted Llm - Mano

Why would you self host your model?

Data Privacy and Security: You can ensure that sensitive information or proprietary data doesn't leave your organization's network. Some industries, such as healthcare or finance, have strict regulatory requirements governing data handling and processing. Self-hosting may provide more control over compliance with these regulations.
Low Latency: If you need low-latency access to the model, hosting it on your own servers can minimize network latency, providing faster responses. Further, you can deploy the model in data centers that are strategically located to minimize latency for those users.
Customization: You can fine-tune the model on your specific data and train it to perform tasks that are tailored to your business. You can even run the model offline with no internet connection.
Cost Control: Depending on the volume of requests and usage, self-hosting might be more cost-effective in the long run compared to using cloud-based API services, which charge per request or usage.
Scalability: You can add more resources as your workload increases. This means that you no longer need to worry about OpenAI rate limits. Relying on third-party APIs or cloud services means depending on external providers. Self-hosting reduces this dependency and minimizes the risk of service disruptions caused by external factors.

It's important to note that self-hosting also comes with challenges and responsibilities, such as managing server infrastructure, ensuring model performance and reliability, and handling updates and maintenance. These tasks can be time-consuming and require specialized knowledge and skills which is exactly why offload all this work to Mano which provisions monitoring infrastructure to scale your fleet based on your usage:

How much would it cost to self host on GPU?

Server type is the primary cost factor for hosting your own LLM on AWS, the cost of the server depends on the model chosen and memory requirements. The following table shows the cost of hosting a single LLM model on AWS EC2 instances with different configurations. We used a quanitized version of llama 2 to reduce memory requirements and cost:

Model	Instance	Latency (ms/token)	Throughput (tokens/s)	Cost ($/mo)
Llama 2 7B	g5.2xlarge	34.24	120	$885
Llama 2 13B	g5.2xlarge	56.23	71	$885
Llama 2 70B	ml.g5.12xlarge	138.34	33	$4140

Can you run the models on CPU to save costs?

Slower Inference: CPUs are generally slower than GPUs for deep learning tasks. Inference (generating responses from a pre-trained model) on a CPU will typically be slower, which can impact the responsiveness of applications using the model. It's roughly 4 times slower than GPU when optimized for number of threads and quantized (see picture below for number of tokens per second for different 7B models).
Cost Savings: You will save money by using CPUs instead of GPUs. Using quantization you can fit the models in smaller memory instances and save even more money. You will have to max out the core CPU utilization to achieve high token throughput and low latency. Libraries like llama.cpp (opens in a new tab) and ggml (opens in a new tab) simplify the process of running LLMs on CPU.
Model Size: The size of the model matters. Smaller versions of language models may perform reasonably well on CPUs, whereas larger models may struggle due to their computational demands.
Caching: Implementing a caching mechanism for frequently used inputs can reduce the load on the CPU by reusing precomputed responses.

Is fine-tuning possible with a self-hosted LLMs?

Yes, fine-tuning allows users to adapt a pre-trained LLM to specific tasks or domains, making it more specialized and accurate for particular use cases. Here is a brief explaination:

Customize for Specific Tasks: You can fine-tune the LLM to perform tasks like sentiment analysis, named entity recognition, or any other text-based task relevant to your business or project.
Domain Adaptation: If your data is domain-specific (e.g., legal, medical, financial), fine-tuning can help the model better understand and generate text in that domain.
Improved Performance: Fine-tuning often leads to better performance on your specific tasks compared to using a generic pre-trained model.

Can I integrate a self-hosted LLM with my existing applications and services?

API Integration: We provide an OpenAI compatible endpoint that can be used by developers to extend their existing services. We provide some SDKs that provide one liners to call your LLMs and perform inference.
Workflow Builder: We provide a no-code workflow builder that can be used to build workflows that call your LLMs and perform inference. A workflow can be started by a webhook (when a form is submitted for example) or a cron job (every day at midnight).

Current Offerings

There are many players that provide LLM hosting services. Ultimately the choice of service will depend on your usecase, there is no one size fits all solution. If you have a heavier workload you are probably better with self hosting for cost purposes. If you have a lighter workload and want to get started quickly you are probably better with a serverless solution.

Name	Description	Self-Host	Models	Serverless
Anyscale (opens in a new tab)	Anyscale Endpoints is a fast and scalable API to integrate OSS LLMs into your app.	Yes	Llama	Yes
AWS Bedrock (opens in a new tab)	The easiest way to build and scale generative AI applications with foundation models	Yes	Llama, Anthropic, AI21 and Amazon Titan	Yes
AWS Sagemaker (opens in a new tab)	Build, train, and deploy machine learning (ML) models on AWS	Yes	Everything	Yes
Banana (opens in a new tab)	Serverless GPUs for AI	No	Everything	Yes
Beam.cloud (opens in a new tab)	Train and deploy AI and LLM applications securely on serverless GPUs	No	Everything	Yes
Cerebrium (opens in a new tab)	Serverless infrastructure for ML: train, deplay and monitor	No	Everything	Yes
Haven (opens in a new tab)	Train and Deploy Open Source AI	No	Everything	No
Hugging Face (opens in a new tab)	Platform to host models and apps	Yes	Everything	No
Inferless (opens in a new tab)	Serverless GPUs to scale your machine learning inference without any hassle	No	Everything	Yes
Modal (opens in a new tab)	Cloud functions re-imagined	No	Everything	Yes
Mystic (opens in a new tab)	Serverless GPU inference for ML models	No	Everything	Yes
Replicate (opens in a new tab)	Run models in the cloud at scale	No	Everything	No
Scale AI (opens in a new tab)	Run inference and fine-tuning on Scale's infrastructure	Yes	Llama, Falcon, MPT	No
PyQ (opens in a new tab)	Build and Deploy Customized Task Specific AI with Ease	No	Everything	No
Preplixity (opens in a new tab)	Generates a model's response for the given chat conversation.	No	Everything	No

Fine Tuning Internal Workflows