tieandrews commited on
Commit
ad091d5
·
1 Parent(s): 994b162

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -23
README.md CHANGED
@@ -1,13 +1,17 @@
1
  ---
2
  tags:
3
  - Beta
4
- license: "mit"
5
- thumbnail: "https://huggingface.co/finding-fossils/metaextractor/resolve/main/ffossils-logo-text.png"
 
6
  widget:
7
- - text: "The core sample was aged at 12300 - 13500 BP and found at 210m a.s.l."
8
- example_title: "Age/Alti"
9
- - text: "In Northern Canada, the BGC site core was primarily made up of Pinus pollen."
10
- example_title: "Taxa/Site/Region"
 
 
 
11
  ---
12
 
13
  <img src="https://huggingface.co/finding-fossils/metaextractor/resolve/main/ffossils-logo-text.png" width="400">
@@ -52,23 +56,13 @@ The entities detected by this model are:
52
 
53
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
54
 
55
- ### Direct Use
56
-
57
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
58
-
59
- [More Information Needed]
60
-
61
- ### Downstream Use [optional]
62
-
63
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
64
-
65
- [More Information Needed]
66
 
67
- ### Out-of-Scope Use
68
 
69
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
70
 
71
- [More Information Needed]
72
 
73
  ## Bias, Risks, and Limitations
74
 
@@ -86,7 +80,41 @@ Users (both direct and downstream) should be made aware of the risks, biases and
86
 
87
  Use the code below to get started with the model.
88
 
89
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  ## Training Details
92
 
@@ -94,7 +122,21 @@ Use the code below to get started with the model.
94
 
95
  <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
96
 
97
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
 
99
  ### Training Procedure
100
 
@@ -211,4 +253,4 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
211
 
212
  ## Model Card Contact
213
 
214
- [More Information Needed]
 
1
  ---
2
  tags:
3
  - Beta
4
+ license: mit
5
+ thumbnail: >-
6
+ https://huggingface.co/finding-fossils/metaextractor/resolve/main/ffossils-logo-text.png
7
  widget:
8
+ - text: The core sample was aged at 12300 - 13500 BP and found at 210m a.s.l.
9
+ example_title: Age/Alti
10
+ - text: In Northern Canada, the BGC site core was primarily made up of Pinus pollen.
11
+ example_title: Taxa/Site/Region
12
+ metrics:
13
+ - precision
14
+ - recall
15
  ---
16
 
17
  <img src="https://huggingface.co/finding-fossils/metaextractor/resolve/main/ffossils-logo-text.png" width="400">
 
56
 
57
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
58
 
59
+ This model can be used to extract entities from any text that are Paeleoecology related or tangential. Potential uses include identifying unique SITE names in research papers in other domains.
 
 
 
 
 
 
 
 
 
 
60
 
61
+ ### Direct Use
62
 
63
+ This model is deployed on the xDD (formerly GeoDeepDive) servers where it is getting fed new research articles relevant to Neotoma and returning the extracted data.
64
 
65
+ This approach could be adapted to other domains by using the training and development code found [github.com/NeotomaDB/MetaExtractor](https://github.com/NeotomaDB/MetaExtractor) to run similar data extraction for other research domains.
66
 
67
  ## Bias, Risks, and Limitations
68
 
 
80
 
81
  Use the code below to get started with the model.
82
 
83
+ ```python
84
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
85
+ from transformers import pipeline
86
+
87
+ tokenizer = AutoTokenizer.from_pretrained("finding-fossils/metaextractor")
88
+ model = AutoModelForTokenClassification.from_pretrained("finding-fossils/metaextractor")
89
+ ner_pipe = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
90
+
91
+ ner_pipe("In Northern Canada, the BGC site core was primarily made up of Pinus pollen.")
92
+
93
+ # Output
94
+ [
95
+ {
96
+ "entity_group": "REGION",
97
+ "score": 0.8088379502296448,
98
+ "word": " Northern Canada,",
99
+ "start": 3,
100
+ "end": 19
101
+ },
102
+ {
103
+ "entity_group": "SITE",
104
+ "score": 0.8307041525840759,
105
+ "word": " BGC",
106
+ "start": 24,
107
+ "end": 27
108
+ },
109
+ {
110
+ "entity_group": "TAXA",
111
+ "score": 0.9806344509124756,
112
+ "word": " Pinus",
113
+ "start": 63,
114
+ "end": 68
115
+ }
116
+ ]
117
+ ```
118
 
119
  ## Training Details
120
 
 
122
 
123
  <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
124
 
125
+ The model was trained using a set of 39 research articles deemed relevant to the Neotoma Database. All articles were written in English. The entities were labeled by the project team along with using pre-labelling with early models to speed up the labelling process.
126
+
127
+ A 70/15/15 train/val/test split was used which had the following breakdown of words and entities.
128
+
129
+ | | Train | Validation | Test|
130
+ |---|:---:|:---:|:---:|
131
+ |Articles| 28 | 6 | 6|
132
+ | Words | 220857 | 37809 | 36098 |
133
+ |TAXA Entities | 3352 | 650 | 570 |
134
+ |SITE Entities | 1228 | 177 | 219 |
135
+ | REGION Entities | 2314 | 318 | 258 |
136
+ |GEOG Entities | 188 | 37 | 8 |
137
+ |AGE Entities | 919 | 206 | 153 |
138
+ |ALTI Entities | 99 | 24 | 14 |
139
+ | Email Entities | 14 | 4 | 11 |
140
 
141
  ### Training Procedure
142
 
 
253
 
254
  ## Model Card Contact
255
 
256
+ [More Information Needed]