Commit 77a6b6d1 authored by Xavier Tannier's avatar Xavier Tannier

initial commit

parent 42a0c05a
This tool was originally created in summer 2016 by Gabriel Bellard and Xavier Tannier, at LIMSI-CNRS, Orsay, France.
Code contributors:
- Gabriel Bellard (main contributor)
- Xavier Tannier (contact)
Thanks to the annotators:
- Dominique Ferrandini
- Samuel Laurent
- Daniel Oudet
- Denis Teyssou
Thanks to Christophe Boumenot that helped us with the Windows Wapiti library.
This diff is collapsed.
license to be defined
Software created by Gabriel Bellard and Xavier Tannier at LIMSI.
Please do not reuse or fork without permision (until we have defined the distribution license).
# Source Extractor (in French only)
**SourceExtractor** is a [CRF](https://en.wikipedia.org/wiki/Conditional_random_field)-based tool for extracting primary and secondary sources from news articles in French.
It detects primary sources, secondary sources, performs coreference resolution on sources, and detects anonymous sources. It can produce Brat format for visualization or JSON format for a machine-readable output.
## Requirements
* **java 8+**
* On Windows, give java at least 1 Gb memory: *java -Xmx1g*
* Extensively tested on **Linux**, tested on **Windows**, untested on **Mac** (but the necessary Wapiti native library is loaded).
If you need to recompile Max and Windows libraries, see https://github.com/kermitt2/Wapiti
## Installation
The archive contains the following files and directories:
- `sourceextractor.jar`
- `config.properties`: edit this file with links to the `lib` and `resources` directory on your computer (see below)
- `lib`: contains external librairies and saved models.
Install this directory wherever you want and
edit the `LIB_DIR` property in the configuration file
- `resources`: contains the language-dependent resources for the system
Install this directory wherever you want and
edit the `RESOURCES_DIR` property in the configuration file
- `code`: the source code
- `LICENSE.txt`: the license file
- `README.md`: this file
- `RESULTS.txt`: the results obtained on an annotated test set by the different models
- `AUTHORS.txt`
## Example usage:
The command is similar on Windows or Linux/MAC, except that a different Wapiti jar must be loaded. Wapiti jars are located in directory `lib/wapiti`. [Wapiti](https://wapiti.limsi.fr/) is the tool used for learning and running the CRF model.
**Linux/MAC**
`java -Xmx1g -cp lib/jar/*:lib/jar/wapiti/wapiti-1.5.0-lin.jar:source-extractor-0.1.jar fr.limsi.sourceExtractor.SourceExtractor -d <INPUT DIR> -o <OUTPUT DIR> -c <CONFIGURATION FILE> -j <THREAD NUMBER>`
**Windows**
`java -Xrs -Xmx1g -cp lib\jar\*;lib\jar\wapiti\wapiti-1.5.0-win.jar;source-extractor-0.1.jar fr.limsi.sourceExtractor.SourceExtractor -d <INPUT DIR> -o <OUTPUT DIR> -c <CONFIGURATION FILE> -j <THREAD NUMBER>`
### Input:
* can be a file (`-f`) or a directory (`-d`) containing documents (for convience notation only, we can actually feed a file or a directory with both -f and -d).
* Default input is textual documents. Use option `-newsml` for NewsML files. (Actually the option -newsml accepts any XML document which content is in <p> elements)
### Output:
* `-o <DIR>` specifies the output directory
* default output format is JSON. Use option `-b` or `--brat` for a [Brat](http://brat.nlplab.org/) output
* See a description of the JSON output below
### Configuration file:
`config.properties` must be edited with the following information:
* `LIB_DIR` = <path to the directory 'lib' downloaded with the distribution
* `RESOURCES_DIR` = <path to the directory 'resources' downloaded with the distribution
* `DATA_DIR` is only necessary for training purpose, there is no need to set it properly in production mode.
## Want a faster process?
* Multi-threading:
Default is single threading. Use `-j N` for using N threads.
* No secondary sources.
Use option `-p` to cancel the extraction of secondary sources. Both loading phase and process will be (much) faster. Loading phase will skip the huge list of media names and process phase will skip the Wapiti extraction of secondary sources.
## License & Co
### License
See the file `LICENSE.txt`
### Third-party librairies and licenses
See the file `THIRD-PARTY.txt`
## Technical details
### JSON output format
Here is what a JSON output looks like
```javascript
{"source_sentences":[
{"text":"Il a affirmé mardi devant les juges de la CPI n'être responsable d'\"aucune goutte de sang\" versée lors des violences ayant déchiré la Côte d'Ivoire en 2010-2011.",
"sources":[
{"start":635, // start offset of the source in the entire document
"end":637, // end offset of the source in the entire document
"type":"SOURCE-PRIM", // type (SOURCE-PRIM or SOURCE-SEC)
"text":"Il", // text
"value":"Charles Blé Goudé, le ministre de la Jeunesse de l'ancien président ivoirien Laurent Gbagbo", // normalized value (after coreference resolution)
"indexed_value":"Charles Blé Goudé" // normalized value for indexing (a shorter version of the normalized value, where ambiguity on several names inis removed when it exists). Indexing for further research should be done on this field.
}
...
]
}
...
]}
```
### Training set
Want to compare or reproduce our results? Ask us for our training set, that can be shared under conditions.
### Adapting to other languages
You'll need:
* To annotate a training set (about 300 documents -- 2000 sources -- seems to be a good number)
* To build a few resources (citation verbs, profession list, etc.)
* To have a dependency parser in your language, and ideally a lemmatizer
Contact us if you are interested!
AFP documents only (total 255 documents, 75% train, 10% dev, 15% test)
- BIO, with media list and secondary sources
Summary
TP FP FN Precision Recall F1
SOURCE-PRIM 167 15 68 0,9176 0,7106 0,8010
SOURCE-SEC 16 2 17 0,8889 0,4848 0,6275
Overall 183 17 85 0,9150 0,6828 0,7821
- BIO without media list and secondary sources
Summary
TP FP FN Precision Recall F1
SOURCE-PRIM 167 14 67 0,9227 0,7137 0,8048
SOURCE-SEC 0 0 33 0,0000 0,0000 0,0000
Overall 167 14 100 0,9227 0,6255 0,7455
- IO, with media list and secondary sources
Summary
TP FP FN Precision Recall F1
SOURCE-PRIM 174 16 60 0,9158 0,7436 0,8208
SOURCE-SEC 13 0 20 1,0000 0,3939 0,5652
Overall 187 16 80 0,9212 0,7004 0,7957
- IO, without media list and secondary sources
Summary
TP FP FN Precision Recall F1
SOURCE-PRIM 172 16 62 0,9149 0,7350 0,8152
SOURCE-SEC 0 0 33 0,0000 0,0000 0,0000
Overall 172 16 95 0,9149 0,6442 0,7560
AFP+Web documents (total 363 documents, 75% train, 10% dev, 15% test)
- BIO, with media list and secondary sources
Summary
TP FP FN Precision Recall F1
SOURCE-PRIM 256 28 68 0,9014 0,7901 0,8421
SOURCE-SEC 28 2 22 0,9333 0,5600 0,7000
Overall 284 30 90 0,9045 0,7594 0,8256
- BIO, without media list and secondary sources
Summary
TP FP FN Precision Recall F1
SOURCE-PRIM 256 28 68 0,9014 0,7901 0,8421
SOURCE-SEC 0 0 50 0,0000 0,0000 0,0000
Overall 256 28 118 0,9014 0,6845 0,7781
* commons-cli (Apache License, Version 2.0)
* commons-collections (Apache License, Version 2.0)
* commons-io (Apache License, Version 2.0)
* commons-lang (Apache License, Version 2.0)
* grobid (Apache License, Version 2.0)
* guava (Apache License, Version 2.0)
* hfst-ol (Apache License, Version 2.0)
* liblinear-java (https://github.com/bwaldvogel/liblinear-java/blob/master/COPYRIGHT)
* log4j (Apache License, Version 2.0)
* slf4j (MIT License)
* maltparser (http://www.maltparser.org/license.html)
* maltparser French model has been provided by Candito et al. This model can be used for research purposes provided that you have a (free) license for the French Treebank. If you want to use it for commercial applications, please contact the license holder for the treebank to find out which conditions apply.
* stanford-corenlp (GNU General Public License (v3))
* wapiti-X.jar and native libraries have be compiled by ourselves,
with the help of codes from Grobid and Christophe Boumenot
* lemmatization is an adapted version of Ahmet Aker's code, coming with no license information (http://staffwww.dcs.shef.ac.uk/people/A.Aker/activityNLPProjects.html)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment