Usage
This guide covers the server setup, and usage of FastMLX, including making API calls and managing models.
1. Installation
Follow the installation guide to install FastMLX.
2. Running the server
Start the FastMLX server with the following command:
or
Using uvicorn
directly:
| uvicorn fastmlx:app --reload --workers 0
|
[!WARNING]
The --reload
flag should not be used in production. It is only intended for development purposes.
Running with Multiple Workers (Parallel Processing)
For improved performance and parallel processing capabilities, you can specify either the absolute number of worker processes or the fraction of CPU cores to use.
You can set the number of workers in three ways (listed in order of precedence):
-
Command-line argument:
or
| uvicorn fastmlx:app --workers 4
|
-
Environment variable:
| export FASTMLX_NUM_WORKERS=4
fastmlx
|
-
Default value (2 workers)
To use all available CPU cores, set the value to 1.0:
[!NOTE]
- The --reload
flag is not compatible with multiple workers.
- The number of workers should typically not exceed the number of CPU cores available on your machine for optimal performance.
Considerations for Multi-Worker Setup
- Stateless Application: Ensure your FastMLX application is stateless, as each worker process operates independently.
- Database Connections: If your app uses a database, make sure your connection pooling is configured to handle multiple workers.
- Resource Usage: Monitor your system's resource usage to find the optimal number of workers for your specific hardware and application needs.
- Load Balancing: When running with multiple workers, incoming requests are automatically load-balanced across the worker processes.
3. Making API Calls
Use the API similar to OpenAI's chat completions:
Vision Language Model
Without Streaming
Here's an example of how to use a Vision Language Model:
1
2
3
4
5
6
7
8
9
10
11
12
13
14 | import requests
import json
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "mlx-community/nanoLLaVA-1.5-4bit",
"image": "http://images.cocodataset.org/val2017/000000039769.jpg",
"messages": [{"role": "user", "content": "What are these"}],
"max_tokens": 100
}
response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())
|
Without Streaming
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48 | import requests
import json
def process_sse_stream(url, headers, data):
response = requests.post(url, headers=headers, json=data, stream=True)
if response.status_code != 200:
print(f"Error: Received status code {response.status_code}")
print(response.text)
return
full_content = ""
try:
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
event_data = line[6:] # Remove 'data: ' prefix
if event_data == '[DONE]':
print("\nStream finished. ✅")
break
try:
chunk_data = json.loads(event_data)
content = chunk_data['choices'][0]['delta']['content']
full_content += content
print(content, end='', flush=True)
except json.JSONDecodeError:
print(f"\nFailed to decode JSON: {event_data}")
except KeyError:
print(f"\nUnexpected data structure: {chunk_data}")
except KeyboardInterrupt:
print("\nStream interrupted by user.")
except requests.exceptions.RequestException as e:
print(f"\nAn error occurred: {e}")
if __name__ == "__main__":
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "mlx-community/nanoLLaVA-1.5-4bit",
"image": "http://images.cocodataset.org/val2017/000000039769.jpg",
"messages": [{"role": "user", "content": "What are these?"}],
"max_tokens": 500,
"stream": True
}
process_sse_stream(url, headers, data)
|
Language Model
Without Streaming
Here's an example of how to use a Language Model:
1
2
3
4
5
6
7
8
9
10
11
12
13 | import requests
import json
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "mlx-community/gemma-2-9b-it-4bit",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"max_tokens": 100
}
response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())
|
With Streaming
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47 | import requests
import json
def process_sse_stream(url, headers, data):
response = requests.post(url, headers=headers, json=data, stream=True)
if response.status_code != 200:
print(f"Error: Received status code {response.status_code}")
print(response.text)
return
full_content = ""
try:
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
event_data = line[6:] # Remove 'data: ' prefix
if event_data == '[DONE]':
print("\nStream finished. ✅")
break
try:
chunk_data = json.loads(event_data)
content = chunk_data['choices'][0]['delta']['content']
full_content += content
print(content, end='', flush=True)
except json.JSONDecodeError:
print(f"\nFailed to decode JSON: {event_data}")
except KeyError:
print(f"\nUnexpected data structure: {chunk_data}")
except KeyboardInterrupt:
print("\nStream interrupted by user.")
except requests.exceptions.RequestException as e:
print(f"\nAn error occurred: {e}")
if __name__ == "__main__":
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "mlx-community/gemma-2-9b-it-4bit",
"messages": [{"role": "user", "content": "Hi, how are you?"}],
"max_tokens": 500,
"stream": True
}
process_sse_stream(url, headers, data)
|
For more detailed API documentation, please refer to the API Reference section.