Spaces:

AuraSystems
/

spanish-embeddings-api

Sleeping

App Files Files Community

spanish-embeddings-api / README.md

Jordi Catafal

cleaning + readme

023e423 25 days ago

preview code

raw

history blame contribute delete

13.6 kB

	---
	title: Spanish Embeddings Api
	emoji: 🐨
	colorFrom: green
	colorTo: green
	sdk: docker
	pinned: false
	---

	# Multilingual & Legal Embeddings API

	A high-performance FastAPI application providing access to 5 specialized embedding models for Spanish, Catalan, English, and multilingual text. Each model has its own dedicated endpoint for optimal performance and clarity.

	🌐 Live API: [https://aurasystems-spanish-embeddings-api.hf.space](https://aurasystems-spanish-embeddings-api.hf.space)
	📖 Interactive Docs: [https://aurasystems-spanish-embeddings-api.hf.space/docs](https://aurasystems-spanish-embeddings-api.hf.space/docs)

	## 🚀 Quick Start

	### Basic Usage
	```bash
	# Test jina-v3 endpoint (multilingual, loads at startup)
	curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/jina-v3" \
	-H "Content-Type: application/json" \
	-d '{"texts": ["Hello world", "Hola mundo"], "normalize": true}'

	# Test Catalan RoBERTa endpoint
	curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/roberta-ca" \
	-H "Content-Type: application/json" \
	-d '{"texts": ["Bon dia", "Com estàs?"], "normalize": true}'
	```

	## 📚 Available Models & Endpoints

	\| Endpoint \| Model \| Languages \| Dimensions \| Max Tokens \| Loading Strategy \|
	\|----------\|--------\|-----------\|------------\|------------\|------------------\|
	\| `/embed/jina-v3` \| jinaai/jina-embeddings-v3 \| Multilingual (30+) \| 1024 \| 8192 \| Startup \|
	\| `/embed/roberta-ca` \| projecte-aina/roberta-large-ca-v2 \| Catalan \| 1024 \| 512 \| On-demand \|
	\| `/embed/jina` \| jinaai/jina-embeddings-v2-base-es \| Spanish, English \| 768 \| 8192 \| On-demand \|
	\| `/embed/robertalex` \| PlanTL-GOB-ES/RoBERTalex \| Spanish Legal \| 768 \| 512 \| On-demand \|
	\| `/embed/legal-bert` \| nlpaueb/legal-bert-base-uncased \| English Legal \| 768 \| 512 \| On-demand \|

	### Model Recommendations

	- 🌍 General multilingual: Use `/embed/jina-v3` - Best overall performance
	- 🇪🇸 Spanish general: Use `/embed/jina` - Excellent for Spanish/English
	- 🇪🇸 Spanish legal: Use `/embed/robertalex` - Specialized for legal texts
	- 🏴󠁧󠁢󠁣󠁡󠁴󠁿 Catalan: Use `/embed/roberta-ca` - Best for Catalan text
	- 🇬🇧 English legal: Use `/embed/legal-bert` - Specialized for legal documents

	## 🔗 API Endpoints

	### Model-Specific Embedding Endpoints

	Each model has its dedicated endpoint:

	```
	POST /embed/jina-v3 # Multilingual (startup model)
	POST /embed/roberta-ca # Catalan
	POST /embed/jina # Spanish/English
	POST /embed/robertalex # Spanish Legal
	POST /embed/legal-bert # English Legal
	```

	### Utility Endpoints

	```
	GET / # API information
	GET /health # Health check and model status
	GET /models # List all models with specifications
	```

	## 📖 Usage Examples

	### Python

	```python
	import requests

	API_URL = "https://aurasystems-spanish-embeddings-api.hf.space"

	# Example 1: Multilingual with Jina v3 (startup model - fastest)
	response = requests.post(
	f"{API_URL}/embed/jina-v3",
	json={
	"texts": [
	"Hello world", # English
	"Hola mundo", # Spanish
	"Bonjour monde", # French
	"こんにちは世界" # Japanese
	],
	"normalize": True
	}
	)
	result = response.json()
	print(f"Jina v3: {result['dimensions']} dimensions") # 1024

	# Example 2: Catalan text with RoBERTa-ca
	response = requests.post(
	f"{API_URL}/embed/roberta-ca",
	json={
	"texts": [
	"Bon dia, com estàs?",
	"Barcelona és una ciutat meravellosa",
	"M'agrada la cultura catalana"
	],
	"normalize": True
	}
	)
	catalan_result = response.json()
	print(f"Catalan: {catalan_result['dimensions']} dimensions") # 1024

	# Example 3: Spanish legal text with RoBERTalex
	response = requests.post(
	f"{API_URL}/embed/robertalex",
	json={
	"texts": [
	"Artículo primero de la constitución",
	"El contrato será válido desde la fecha de firma",
	"La jurisprudencia establece que..."
	],
	"normalize": True
	}
	)
	legal_result = response.json()
	print(f"Spanish Legal: {legal_result['dimensions']} dimensions") # 768

	# Example 4: English legal text with Legal-BERT
	response = requests.post(
	f"{API_URL}/embed/legal-bert",
	json={
	"texts": [
	"This agreement is legally binding",
	"The contract shall be governed by English law",
	"The party hereby agrees and covenants"
	],
	"normalize": True
	}
	)
	english_legal_result = response.json()
	print(f"English Legal: {english_legal_result['dimensions']} dimensions") # 768

	# Example 5: Spanish/English bilingual with Jina v2
	response = requests.post(
	f"{API_URL}/embed/jina",
	json={
	"texts": [
	"Inteligencia artificial y machine learning",
	"Artificial intelligence and machine learning",
	"Procesamiento de lenguaje natural"
	],
	"normalize": True
	}
	)
	bilingual_result = response.json()
	print(f"Bilingual: {bilingual_result['dimensions']} dimensions") # 768
	```

	### JavaScript/Node.js

	```javascript
	const API_URL = 'https://aurasystems-spanish-embeddings-api.hf.space';

	// Function to get embeddings from specific endpoint
	async function getEmbeddings(endpoint, texts) {
	const response = await fetch(`${API_URL}/embed/${endpoint}`, {
	method: 'POST',
	headers: {
	'Content-Type': 'application/json',
	},
	body: JSON.stringify({
	texts: texts,
	normalize: true
	})
	});

	if (!response.ok) {
	throw new Error(`Error: ${response.status}`);
	}

	return await response.json();
	}

	// Usage examples
	try {
	// Multilingual embeddings
	const multilingualResult = await getEmbeddings('jina-v3', [
	'Hello world',
	'Hola mundo',
	'Ciao mondo'
	]);
	console.log('Multilingual dimensions:', multilingualResult.dimensions);

	// Catalan embeddings
	const catalanResult = await getEmbeddings('roberta-ca', [
	'Bon dia',
	'Com estàs?'
	]);
	console.log('Catalan dimensions:', catalanResult.dimensions);

	} catch (error) {
	console.error('Error:', error);
	}
	```

	### cURL Examples

	```bash
	# Multilingual with Jina v3 (startup model)
	curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/jina-v3" \
	-H "Content-Type: application/json" \
	-d '{
	"texts": ["Hello", "Hola", "Bonjour"],
	"normalize": true
	}'

	# Catalan with RoBERTa-ca
	curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/roberta-ca" \
	-H "Content-Type: application/json" \
	-d '{
	"texts": ["Bon dia", "Com estàs?"],
	"normalize": true
	}'

	# Spanish legal with RoBERTalex
	curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/robertalex" \
	-H "Content-Type: application/json" \
	-d '{
	"texts": ["Artículo primero"],
	"normalize": true
	}'

	# English legal with Legal-BERT
	curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/legal-bert" \
	-H "Content-Type: application/json" \
	-d '{
	"texts": ["This agreement is binding"],
	"normalize": true
	}'

	# Spanish/English bilingual with Jina v2
	curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/jina" \
	-H "Content-Type: application/json" \
	-d '{
	"texts": ["Texto en español", "Text in English"],
	"normalize": true
	}'
	```

	## 📋 Request/Response Schema

	### Request Body

	```json
	{
	"texts": ["text1", "text2", "..."],
	"normalize": true,
	"max_length": null
	}
	```

	\| Field \| Type \| Required \| Default \| Description \|
	\|-------\|------\|----------\|---------\|-------------\|
	\| `texts` \| array[string] \| ✅ Yes \| - \| 1-50 texts to embed \|
	\| `normalize` \| boolean \| No \| `true` \| L2-normalize embeddings \|
	\| `max_length` \| integer/null \| No \| `null` \| Max tokens (model-specific limits) \|

	### Response Body

	```json
	{
	"embeddings": [[0.123, -0.456, ...], [0.789, -0.012, ...]],
	"model_used": "jina-v3",
	"dimensions": 1024,
	"num_texts": 2
	}
	```

	## ⚡ Performance & Limits

	- Maximum texts per request: 50
	- Startup model: `jina-v3` loads at startup (fastest response)
	- On-demand models: Load on first request (~30-60s first time)
	- Typical response time: 100-300ms after models are loaded
	- Memory optimization: Automatic cleanup for large batches
	- CORS enabled: Works from any domain

	## 🔧 Advanced Usage

	### LangChain Integration

	```python
	from langchain.embeddings.base import Embeddings
	from typing import List
	import requests

	class MultilingualEmbeddings(Embeddings):
	"""LangChain integration for multilingual embeddings"""

	def __init__(self, endpoint: str = "jina-v3"):
	"""
	Initialize with specific endpoint

	Args:
	endpoint: One of "jina-v3", "roberta-ca", "jina", "robertalex", "legal-bert"
	"""
	self.api_url = f"https://aurasystems-spanish-embeddings-api.hf.space/embed/{endpoint}"
	self.endpoint = endpoint

	def embed_documents(self, texts: List[str]) -> List[List[float]]:
	response = requests.post(
	self.api_url,
	json={"texts": texts, "normalize": True}
	)
	response.raise_for_status()
	return response.json()["embeddings"]

	def embed_query(self, text: str) -> List[float]:
	return self.embed_documents([text])[0]

	# Usage examples
	multilingual_embeddings = MultilingualEmbeddings("jina-v3")
	catalan_embeddings = MultilingualEmbeddings("roberta-ca")
	spanish_legal_embeddings = MultilingualEmbeddings("robertalex")
	```

	### Semantic Search

	```python
	import numpy as np
	from typing import List, Tuple

	def semantic_search(query: str, documents: List[str], endpoint: str = "jina-v3", top_k: int = 5):
	"""Semantic search using specific model endpoint"""

	response = requests.post(
	f"https://aurasystems-spanish-embeddings-api.hf.space/embed/{endpoint}",
	json={"texts": [query] + documents, "normalize": True}
	)

	embeddings = np.array(response.json()["embeddings"])
	query_embedding = embeddings[0]
	doc_embeddings = embeddings[1:]

	# Calculate cosine similarities (already normalized)
	similarities = np.dot(doc_embeddings, query_embedding)
	top_indices = np.argsort(similarities)[::-1][:top_k]

	return [(idx, similarities[idx]) for idx in top_indices]

	# Example: Multilingual search
	documents = [
	"Python programming language",
	"Lenguaje de programación Python",
	"Llenguatge de programació Python",
	"Language de programmation Python"
	]

	results = semantic_search("código en Python", documents, "jina-v3")
	for idx, score in results:
	print(f"{score:.4f}: {documents[idx]}")
	```

	## 🚨 Error Handling

	### HTTP Status Codes

	\| Code \| Description \|
	\|------\|-------------\|
	\| 200 \| Success \|
	\| 400 \| Bad Request (validation error) \|
	\| 422 \| Unprocessable Entity (schema error) \|
	\| 500 \| Internal Server Error (model loading failed) \|

	### Common Errors

	```python
	# Handle errors properly
	try:
	response = requests.post(
	"https://aurasystems-spanish-embeddings-api.hf.space/embed/jina-v3",
	json={"texts": ["text"], "normalize": True}
	)
	response.raise_for_status()
	result = response.json()
	except requests.exceptions.HTTPError as e:
	print(f"HTTP error: {e}")
	print(f"Response: {response.text}")
	except requests.exceptions.RequestException as e:
	print(f"Request error: {e}")
	```

	## 📊 Model Status Check

	```python
	# Check which models are loaded
	health = requests.get("https://aurasystems-spanish-embeddings-api.hf.space/health")
	status = health.json()

	print(f"API Status: {status['status']}")
	print(f"Startup model loaded: {status['startup_model_loaded']}")
	print(f"Available models: {status['available_models']}")
	print(f"Models loaded: {status['models_count']}/5")

	# Check endpoint status
	for model, endpoint_status in status['endpoints'].items():
	print(f"{model}: {endpoint_status}")
	```

	## 🔒 Authentication & Rate Limits

	- Authentication: None required (open API)
	- Rate limits: Generous limits on Hugging Face Spaces
	- CORS: Enabled for all origins
	- Usage: Free for research and commercial use

	## 🏗️ Architecture

	### Endpoint-Per-Model Design
	- Startup model: `jina-v3` loads at application startup for fastest response
	- On-demand loading: Other models load when first requested
	- Memory optimization: Progressive loading reduces startup time
	- Model caching: Once loaded, models remain in memory for fast inference

	### Technical Stack
	- FastAPI: Modern async web framework
	- Transformers: Hugging Face model library
	- PyTorch: Deep learning backend
	- Docker: Containerized deployment
	- Hugging Face Spaces: Cloud hosting platform

	## 📄 Model Licenses

	- Jina models: Apache 2.0
	- RoBERTa models: MIT/Apache 2.0
	- Legal-BERT: Apache 2.0

	## 🤝 Support & Contributing

	- Issues: [GitHub Issues](https://huggingface.co/spaces/AuraSystems/spanish-embeddings-api/discussions)
	- Interactive Docs: [FastAPI Swagger UI](https://aurasystems-spanish-embeddings-api.hf.space/docs)
	- Model Papers: Check individual model pages on Hugging Face

	---

	Built with ❤️ using FastAPI and Hugging Face Transformers