Asao Kojiro

Sentence Segmenter

**Name
sentseg.pl - Sentence Segmenter
**Synopsis: Usage
$ ./sentseg.pl < InputFile > OutputFile
**Description
This perl script takes a text file as standard input and splits it up so that each sentence is on a separate line. The script, however, does not gurantee 100 accuracy because of the reasons described in the Notes. See Notes below.
**Notes
Even though the script works fine for most puposes, 100 percent accuracy is not guranteed. The script determines the place of a sentence boundary on the basis of orthographic features and does not take into consideration its context. For this reason it is indispensable to scan the output file manually after the script is executed in order to see if any irregularies have occurred.

Most errors involve abbreviations with a full stop. The script handles popular abbreviations like Mr., Ms. Dr., and D.C. correctly. It is, however, unrealistic to exhaust all possibilities. If you are going to reapeat the work in a certain genre of text, you can improve its accuracy by modifying the list of abbreviations described in the script. In order to modify the list to suit your purpose, enter new abbreviations in lines 20 and 22.
**Known Irregularities
When a word or an abbreviation represented by one single upper case letter like "J" followed by a full stop (as in "J. Thurston"), the script interprets it as part of the sentence. For this reason, sentences like following are not segmented. These setences have to be separated manually.
So do I.
The answer is C.
**Download
Two files are available. One is the script with English annotations and the other with Japanese remarks. Except the language used for annotations, there is no difference in the body of the script. After downloading the script, change the file name to "sentseg.pl" or any other name you prever.

sentseg.pl.en.txt (with English annotations)
sentseg.pl.jp.txt (with Japanese annotations)

**Acknowledgment
Special thanks to Shinichi Shimizu, who originally wrote the portion of the script that handles abbreaviations. The current script, sentseg.pl, uses his idea upons his permission. Without his permission the script would not have been actualized.
**Copyright
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.

Updated 26 March, 2006
Copyright (C) 2006 Asao Kojiro

Download