[GITLAB] - UPGRADE TO v12 on Wednesday the 18th of December at 11.30AM

README.md 5.03 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
# Source Extractor (in French only)

**SourceExtractor** is a [CRF](https://en.wikipedia.org/wiki/Conditional_random_field)-based tool for extracting primary and secondary sources from news articles in French. 
It detects primary sources, secondary sources, performs coreference resolution on sources, and detects anonymous sources. It can produce Brat format for visualization or JSON format for a machine-readable output.

## Requirements

   * **java 8+**
   * On Windows, give java at least 1 Gb memory: *java -Xmx1g* 
   * Extensively tested on **Linux**, tested on **Windows**, untested on **Mac** (but the necessary Wapiti native library is loaded).

If you need to recompile Max and Windows libraries, see https://github.com/kermitt2/Wapiti

## Installation 

The archive contains the following files and directories:
 - `sourceextractor.jar`
 - `config.properties`: edit this file with links to the `lib` and `resources` directory on your computer (see below)
 - `lib`: contains external librairies and saved models. 
        Install this directory wherever you want and 
		edit the `LIB_DIR` property in the configuration file
 - `resources`: contains the language-dependent resources for the system
        Install this directory wherever you want and 
		edit the `RESOURCES_DIR` property in the configuration file
 - `code`: the source code
 - `LICENSE.txt`: the license file
 - `README.md`: this file
 - `RESULTS.txt`: the results obtained on an annotated test set by the different models
 - `AUTHORS.txt`
 

## Example usage:

The command is similar on Windows or Linux/MAC, except that a different Wapiti jar must be loaded. Wapiti jars are located in directory `lib/wapiti`. [Wapiti](https://wapiti.limsi.fr/) is the tool used for learning and running the CRF model.

**Linux/MAC**

`java -Xmx1g -cp lib/jar/*:lib/jar/wapiti/wapiti-1.5.0-lin.jar:source-extractor-0.1.jar fr.limsi.sourceExtractor.SourceExtractor -d <INPUT DIR> -o <OUTPUT DIR> -c <CONFIGURATION FILE> -j <THREAD NUMBER>`

**Windows**

`java -Xrs -Xmx1g -cp lib\jar\*;lib\jar\wapiti\wapiti-1.5.0-win.jar;source-extractor-0.1.jar fr.limsi.sourceExtractor.SourceExtractor -d <INPUT DIR> -o <OUTPUT DIR> -c <CONFIGURATION FILE> -j <THREAD NUMBER>`


### Input:
   * can be a file (`-f`) or a directory (`-d`) containing documents (for convience notation only, we can actually feed a file or a directory with both -f and -d).
   * Default input is textual documents. Use option `-newsml` for NewsML files. (Actually the option -newsml accepts any XML document which content is in <p> elements)

### Output:
   * `-o <DIR>` specifies the output directory
   * default output format is JSON. Use option `-b` or `--brat` for a [Brat](http://brat.nlplab.org/) output
   * See a description of the JSON output below

### Configuration file: 
   `config.properties` must be edited with the following information:
   * `LIB_DIR` = <path to the directory 'lib' downloaded with the distribution
   * `RESOURCES_DIR` = <path to the directory 'resources' downloaded with the distribution
   * `DATA_DIR` is only necessary for training purpose, there is no need to set it properly in production mode.

## Want a faster process?
   * Multi-threading:
   Default is single threading. Use `-j N` for using N threads.
   * No secondary sources.
   Use option `-p` to cancel the extraction of secondary sources. Both loading phase and process will be (much) faster. Loading phase will skip the huge list of media names and process phase will skip the Wapiti extraction of secondary sources.

## License & Co

### License

See the file `LICENSE.txt`

### Third-party librairies and licenses

See the file `THIRD-PARTY.txt`

## Technical details

### JSON output format

Here is what a JSON output looks like
```javascript
{"source_sentences":[
  {"text":"Il a affirmé mardi devant les juges de la CPI n'être responsable d'\"aucune goutte de sang\" versée lors des violences ayant déchiré la Côte d'Ivoire en 2010-2011.",
   "sources":[
      {"start":635,  // start offset of the source in the entire document
       "end":637,  // end offset of the source in the entire document
       "type":"SOURCE-PRIM",  // type (SOURCE-PRIM or SOURCE-SEC)
       "text":"Il",  // text
       "value":"Charles Blé Goudé, le ministre de la Jeunesse de l'ancien président ivoirien Laurent Gbagbo",  // normalized value (after coreference resolution)
       "indexed_value":"Charles Blé Goudé"  // normalized value for indexing (a shorter version of the normalized value, where ambiguity on several names inis removed when it exists). Indexing for further research should be done on this field. 
      }
      ...
     ]
  }
  ...
]}
```

### Training set

Want to compare or reproduce our results? Ask us for our training set, that can be shared under conditions.

### Adapting to other languages

You'll need:
   * To annotate a training set (about 300 documents -- 2000 sources -- seems to be a good number)
   * To build a few resources (citation verbs, profession list, etc.)
   * To have a dependency parser in your language, and ideally a lemmatizer

Contact us if you are interested!