# Source Extractor (in French only) **SourceExtractor** is a [CRF](https://en.wikipedia.org/wiki/Conditional_random_field)-based tool for extracting primary and secondary sources from news articles in French. It detects primary sources, secondary sources, performs coreference resolution on sources, and detects anonymous sources. It can produce Brat format for visualization or JSON format for a machine-readable output. ## Requirements * **java 8+** * On Windows, give java at least 1 Gb memory: *java -Xmx1g* * Extensively tested on **Linux**, tested on **Windows**, untested on **Mac** (but the necessary Wapiti native library is loaded). If you need to recompile Max and Windows libraries, see https://github.com/kermitt2/Wapiti ## Installation The archive contains the following files and directories: - `sourceextractor.jar` - `config.properties`: edit this file with links to the `lib` and `resources` directory on your computer (see below) - `lib`: contains external librairies and saved models. Install this directory wherever you want and edit the `LIB_DIR` property in the configuration file - `resources`: contains the language-dependent resources for the system Install this directory wherever you want and edit the `RESOURCES_DIR` property in the configuration file - `code`: the source code - `LICENSE.txt`: the license file - `README.md`: this file - `RESULTS.txt`: the results obtained on an annotated test set by the different models - `AUTHORS.txt` ## Example usage: The command is similar on Windows or Linux/MAC, except that a different Wapiti jar must be loaded. Wapiti jars are located in directory `lib/wapiti`. [Wapiti](https://wapiti.limsi.fr/) is the tool used for learning and running the CRF model. **Linux/MAC** `java -Xmx1g -cp lib/jar/*:lib/jar/wapiti/wapiti-1.5.0-lin.jar:source-extractor-0.1.jar fr.limsi.sourceExtractor.SourceExtractor -d -o -c -j ` **Windows** `java -Xrs -Xmx1g -cp lib\jar\*;lib\jar\wapiti\wapiti-1.5.0-win.jar;source-extractor-0.1.jar fr.limsi.sourceExtractor.SourceExtractor -d -o -c -j ` ### Input: * can be a file (`-f`) or a directory (`-d`) containing documents (for convience notation only, we can actually feed a file or a directory with both -f and -d). * Default input is textual documents. Use option `-newsml` for NewsML files. (Actually the option -newsml accepts any XML document which content is in

elements) ### Output: * `-o

` specifies the output directory * default output format is JSON. Use option `-b` or `--brat` for a [Brat](http://brat.nlplab.org/) output * See a description of the JSON output below ### Configuration file: `config.properties` must be edited with the following information: * `LIB_DIR` =