Jordi Catafal
cleaning + readme
023e423
---
title: Spanish Embeddings Api
emoji: 🐨
colorFrom: green
colorTo: green
sdk: docker
pinned: false
---
# Multilingual & Legal Embeddings API
A high-performance FastAPI application providing access to **5 specialized embedding models** for Spanish, Catalan, English, and multilingual text. Each model has its own dedicated endpoint for optimal performance and clarity.
🌐 **Live API**: [https://aurasystems-spanish-embeddings-api.hf.space](https://aurasystems-spanish-embeddings-api.hf.space)
📖 **Interactive Docs**: [https://aurasystems-spanish-embeddings-api.hf.space/docs](https://aurasystems-spanish-embeddings-api.hf.space/docs)
## 🚀 Quick Start
### Basic Usage
```bash
# Test jina-v3 endpoint (multilingual, loads at startup)
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/jina-v3" \
-H "Content-Type: application/json" \
-d '{"texts": ["Hello world", "Hola mundo"], "normalize": true}'
# Test Catalan RoBERTa endpoint
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/roberta-ca" \
-H "Content-Type: application/json" \
-d '{"texts": ["Bon dia", "Com estàs?"], "normalize": true}'
```
## 📚 Available Models & Endpoints
| Endpoint | Model | Languages | Dimensions | Max Tokens | Loading Strategy |
|----------|--------|-----------|------------|------------|------------------|
| `/embed/jina-v3` | jinaai/jina-embeddings-v3 | Multilingual (30+) | 1024 | 8192 | **Startup** |
| `/embed/roberta-ca` | projecte-aina/roberta-large-ca-v2 | Catalan | 1024 | 512 | On-demand |
| `/embed/jina` | jinaai/jina-embeddings-v2-base-es | Spanish, English | 768 | 8192 | On-demand |
| `/embed/robertalex` | PlanTL-GOB-ES/RoBERTalex | Spanish Legal | 768 | 512 | On-demand |
| `/embed/legal-bert` | nlpaueb/legal-bert-base-uncased | English Legal | 768 | 512 | On-demand |
### Model Recommendations
- **🌍 General multilingual**: Use `/embed/jina-v3` - Best overall performance
- **🇪🇸 Spanish general**: Use `/embed/jina` - Excellent for Spanish/English
- **🇪🇸 Spanish legal**: Use `/embed/robertalex` - Specialized for legal texts
- **🏴󠁧󠁢󠁣󠁡󠁴󠁿 Catalan**: Use `/embed/roberta-ca` - Best for Catalan text
- **🇬🇧 English legal**: Use `/embed/legal-bert` - Specialized for legal documents
## 🔗 API Endpoints
### Model-Specific Embedding Endpoints
Each model has its dedicated endpoint:
```
POST /embed/jina-v3 # Multilingual (startup model)
POST /embed/roberta-ca # Catalan
POST /embed/jina # Spanish/English
POST /embed/robertalex # Spanish Legal
POST /embed/legal-bert # English Legal
```
### Utility Endpoints
```
GET / # API information
GET /health # Health check and model status
GET /models # List all models with specifications
```
## 📖 Usage Examples
### Python
```python
import requests
API_URL = "https://aurasystems-spanish-embeddings-api.hf.space"
# Example 1: Multilingual with Jina v3 (startup model - fastest)
response = requests.post(
f"{API_URL}/embed/jina-v3",
json={
"texts": [
"Hello world", # English
"Hola mundo", # Spanish
"Bonjour monde", # French
"こんにちは世界" # Japanese
],
"normalize": True
}
)
result = response.json()
print(f"Jina v3: {result['dimensions']} dimensions") # 1024
# Example 2: Catalan text with RoBERTa-ca
response = requests.post(
f"{API_URL}/embed/roberta-ca",
json={
"texts": [
"Bon dia, com estàs?",
"Barcelona és una ciutat meravellosa",
"M'agrada la cultura catalana"
],
"normalize": True
}
)
catalan_result = response.json()
print(f"Catalan: {catalan_result['dimensions']} dimensions") # 1024
# Example 3: Spanish legal text with RoBERTalex
response = requests.post(
f"{API_URL}/embed/robertalex",
json={
"texts": [
"Artículo primero de la constitución",
"El contrato será válido desde la fecha de firma",
"La jurisprudencia establece que..."
],
"normalize": True
}
)
legal_result = response.json()
print(f"Spanish Legal: {legal_result['dimensions']} dimensions") # 768
# Example 4: English legal text with Legal-BERT
response = requests.post(
f"{API_URL}/embed/legal-bert",
json={
"texts": [
"This agreement is legally binding",
"The contract shall be governed by English law",
"The party hereby agrees and covenants"
],
"normalize": True
}
)
english_legal_result = response.json()
print(f"English Legal: {english_legal_result['dimensions']} dimensions") # 768
# Example 5: Spanish/English bilingual with Jina v2
response = requests.post(
f"{API_URL}/embed/jina",
json={
"texts": [
"Inteligencia artificial y machine learning",
"Artificial intelligence and machine learning",
"Procesamiento de lenguaje natural"
],
"normalize": True
}
)
bilingual_result = response.json()
print(f"Bilingual: {bilingual_result['dimensions']} dimensions") # 768
```
### JavaScript/Node.js
```javascript
const API_URL = 'https://aurasystems-spanish-embeddings-api.hf.space';
// Function to get embeddings from specific endpoint
async function getEmbeddings(endpoint, texts) {
const response = await fetch(`${API_URL}/embed/${endpoint}`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
texts: texts,
normalize: true
})
});
if (!response.ok) {
throw new Error(`Error: ${response.status}`);
}
return await response.json();
}
// Usage examples
try {
// Multilingual embeddings
const multilingualResult = await getEmbeddings('jina-v3', [
'Hello world',
'Hola mundo',
'Ciao mondo'
]);
console.log('Multilingual dimensions:', multilingualResult.dimensions);
// Catalan embeddings
const catalanResult = await getEmbeddings('roberta-ca', [
'Bon dia',
'Com estàs?'
]);
console.log('Catalan dimensions:', catalanResult.dimensions);
} catch (error) {
console.error('Error:', error);
}
```
### cURL Examples
```bash
# Multilingual with Jina v3 (startup model)
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/jina-v3" \
-H "Content-Type: application/json" \
-d '{
"texts": ["Hello", "Hola", "Bonjour"],
"normalize": true
}'
# Catalan with RoBERTa-ca
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/roberta-ca" \
-H "Content-Type: application/json" \
-d '{
"texts": ["Bon dia", "Com estàs?"],
"normalize": true
}'
# Spanish legal with RoBERTalex
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/robertalex" \
-H "Content-Type: application/json" \
-d '{
"texts": ["Artículo primero"],
"normalize": true
}'
# English legal with Legal-BERT
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/legal-bert" \
-H "Content-Type: application/json" \
-d '{
"texts": ["This agreement is binding"],
"normalize": true
}'
# Spanish/English bilingual with Jina v2
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/jina" \
-H "Content-Type: application/json" \
-d '{
"texts": ["Texto en español", "Text in English"],
"normalize": true
}'
```
## 📋 Request/Response Schema
### Request Body
```json
{
"texts": ["text1", "text2", "..."],
"normalize": true,
"max_length": null
}
```
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `texts` | array[string] | ✅ Yes | - | 1-50 texts to embed |
| `normalize` | boolean | No | `true` | L2-normalize embeddings |
| `max_length` | integer/null | No | `null` | Max tokens (model-specific limits) |
### Response Body
```json
{
"embeddings": [[0.123, -0.456, ...], [0.789, -0.012, ...]],
"model_used": "jina-v3",
"dimensions": 1024,
"num_texts": 2
}
```
## ⚡ Performance & Limits
- **Maximum texts per request**: 50
- **Startup model**: `jina-v3` loads at startup (fastest response)
- **On-demand models**: Load on first request (~30-60s first time)
- **Typical response time**: 100-300ms after models are loaded
- **Memory optimization**: Automatic cleanup for large batches
- **CORS enabled**: Works from any domain
## 🔧 Advanced Usage
### LangChain Integration
```python
from langchain.embeddings.base import Embeddings
from typing import List
import requests
class MultilingualEmbeddings(Embeddings):
"""LangChain integration for multilingual embeddings"""
def __init__(self, endpoint: str = "jina-v3"):
"""
Initialize with specific endpoint
Args:
endpoint: One of "jina-v3", "roberta-ca", "jina", "robertalex", "legal-bert"
"""
self.api_url = f"https://aurasystems-spanish-embeddings-api.hf.space/embed/{endpoint}"
self.endpoint = endpoint
def embed_documents(self, texts: List[str]) -> List[List[float]]:
response = requests.post(
self.api_url,
json={"texts": texts, "normalize": True}
)
response.raise_for_status()
return response.json()["embeddings"]
def embed_query(self, text: str) -> List[float]:
return self.embed_documents([text])[0]
# Usage examples
multilingual_embeddings = MultilingualEmbeddings("jina-v3")
catalan_embeddings = MultilingualEmbeddings("roberta-ca")
spanish_legal_embeddings = MultilingualEmbeddings("robertalex")
```
### Semantic Search
```python
import numpy as np
from typing import List, Tuple
def semantic_search(query: str, documents: List[str], endpoint: str = "jina-v3", top_k: int = 5):
"""Semantic search using specific model endpoint"""
response = requests.post(
f"https://aurasystems-spanish-embeddings-api.hf.space/embed/{endpoint}",
json={"texts": [query] + documents, "normalize": True}
)
embeddings = np.array(response.json()["embeddings"])
query_embedding = embeddings[0]
doc_embeddings = embeddings[1:]
# Calculate cosine similarities (already normalized)
similarities = np.dot(doc_embeddings, query_embedding)
top_indices = np.argsort(similarities)[::-1][:top_k]
return [(idx, similarities[idx]) for idx in top_indices]
# Example: Multilingual search
documents = [
"Python programming language",
"Lenguaje de programación Python",
"Llenguatge de programació Python",
"Language de programmation Python"
]
results = semantic_search("código en Python", documents, "jina-v3")
for idx, score in results:
print(f"{score:.4f}: {documents[idx]}")
```
## 🚨 Error Handling
### HTTP Status Codes
| Code | Description |
|------|-------------|
| 200 | Success |
| 400 | Bad Request (validation error) |
| 422 | Unprocessable Entity (schema error) |
| 500 | Internal Server Error (model loading failed) |
### Common Errors
```python
# Handle errors properly
try:
response = requests.post(
"https://aurasystems-spanish-embeddings-api.hf.space/embed/jina-v3",
json={"texts": ["text"], "normalize": True}
)
response.raise_for_status()
result = response.json()
except requests.exceptions.HTTPError as e:
print(f"HTTP error: {e}")
print(f"Response: {response.text}")
except requests.exceptions.RequestException as e:
print(f"Request error: {e}")
```
## 📊 Model Status Check
```python
# Check which models are loaded
health = requests.get("https://aurasystems-spanish-embeddings-api.hf.space/health")
status = health.json()
print(f"API Status: {status['status']}")
print(f"Startup model loaded: {status['startup_model_loaded']}")
print(f"Available models: {status['available_models']}")
print(f"Models loaded: {status['models_count']}/5")
# Check endpoint status
for model, endpoint_status in status['endpoints'].items():
print(f"{model}: {endpoint_status}")
```
## 🔒 Authentication & Rate Limits
- **Authentication**: None required (open API)
- **Rate limits**: Generous limits on Hugging Face Spaces
- **CORS**: Enabled for all origins
- **Usage**: Free for research and commercial use
## 🏗️ Architecture
### Endpoint-Per-Model Design
- **Startup model**: `jina-v3` loads at application startup for fastest response
- **On-demand loading**: Other models load when first requested
- **Memory optimization**: Progressive loading reduces startup time
- **Model caching**: Once loaded, models remain in memory for fast inference
### Technical Stack
- **FastAPI**: Modern async web framework
- **Transformers**: Hugging Face model library
- **PyTorch**: Deep learning backend
- **Docker**: Containerized deployment
- **Hugging Face Spaces**: Cloud hosting platform
## 📄 Model Licenses
- **Jina models**: Apache 2.0
- **RoBERTa models**: MIT/Apache 2.0
- **Legal-BERT**: Apache 2.0
## 🤝 Support & Contributing
- **Issues**: [GitHub Issues](https://huggingface.co/spaces/AuraSystems/spanish-embeddings-api/discussions)
- **Interactive Docs**: [FastAPI Swagger UI](https://aurasystems-spanish-embeddings-api.hf.space/docs)
- **Model Papers**: Check individual model pages on Hugging Face
---
Built with ❤️ using **FastAPI** and **Hugging Face Transformers**