Spaces:

GotoUsuke
/

GraphRag

Running

App Files Files Community

GraphRag / graphrag /docsite /posts /index /overview.md

GotoUsuke

Upload folder using huggingface_hub

db4a26f verified 7 months ago

preview code

raw

history blame contribute delete

3.23 kB

	---
	title: GraphRAG Indexing 🤖
	navtitle: Indexing Overview
	layout: page
	tags: [post]
	---

	The GraphRAG indexing package is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using LLMs.

	Indexing Pipelines are configurable. They are composed of workflows, standard and custom steps, prompt templates, and input/output adapters. Our standard pipeline is designed to:

	- extract entities, relationships and claims from raw text
	- perform community detection in entities
	- generate community summaries and reports at multiple levels of granularity
	- embed entities into a graph vector space
	- embed text chunks into a textual vector space

	The outputs of the pipeline can be stored in a variety of formats, including JSON and Parquet - or they can be handled manually via the Python API.

	## Getting Started

	### Requirements

	See the [requirements](/posts/developing#requirements) section in [Get Started](/posts/get_started) for details on setting up a development environment.

	The Indexing Engine can be used in either a default configuration mode or with a custom pipeline.
	To configure GraphRAG, see the [configuration](/posts/config/overview) documentation.
	After you have a config file you can run the pipeline using the CLI or the Python API.

	## Usage

	### CLI

	```bash
	# Via Poetry
	poetry run poe cli --root <data_root> # default config mode
	poetry run poe cli --config your_pipeline.yml # custom config mode

	# Via Node
	yarn run:index --root <data_root> # default config mode
	yarn run:index --config your_pipeline.yml # custom config mode

	```

	### Python API

	```python
	from graphrag.index import run_pipeline
	from graphrag.index.config import PipelineWorkflowReference

	workflows: list[PipelineWorkflowReference] = [
	PipelineWorkflowReference(
	steps=[
	{
	# built-in verb
	"verb": "derive", # https://github.com/microsoft/datashaper/blob/main/python/datashaper/datashaper/engine/verbs/derive.py
	"args": {
	"column1": "col1", # from above
	"column2": "col2", # from above
	"to": "col_multiplied", # new column name
	"operator": "*", # multiply the two columns
	},
	# Since we're trying to act on the default input, we don't need explicitly to specify an input
	}
	]
	),
	]

	dataset = pd.DataFrame([{"col1": 2, "col2": 4}, {"col1": 5, "col2": 10}])
	outputs = []
	async for output in await run_pipeline(dataset=dataset, workflows=workflows):
	outputs.append(output)
	pipeline_result = outputs[-1]
	print(pipeline_result)
	```

	## Further Reading

	- To start developing within the _GraphRAG_ project, see [getting started](/posts/developing/)
	- To understand the underlying concepts and execution model of the indexing library, see [the architecture documentation](/posts/index/0-architecture/)
	- To get running with a series of examples, see [the examples documentation](https://github.com/microsoft/graphrag/blob/main/examples/README.md)
	- To read more about configuring the indexing engine, see [the configuration documentation](/posts/config/overview)