diff options
author | Samuel Fadel <samuelfadel@gmail.com> | 2016-08-19 14:20:57 -0300 |
---|---|---|
committer | Samuel Fadel <samuelfadel@gmail.com> | 2016-08-19 14:20:57 -0300 |
commit | b255338295587246292dc978e7d4d5687ee01fb4 (patch) | |
tree | 1581b76a03f4929c5132dcb3c6920fa761f8261c /datasets/newsgroups/stop.sh | |
parent | fbf8d82cdd3720c4bbf2a94035b6779e56d73448 (diff) |
Scripts and other files for building all datasets.
Diffstat (limited to 'datasets/newsgroups/stop.sh')
-rw-r--r-- | datasets/newsgroups/stop.sh | 12 |
1 files changed, 12 insertions, 0 deletions
diff --git a/datasets/newsgroups/stop.sh b/datasets/newsgroups/stop.sh new file mode 100644 index 0000000..36a5f74 --- /dev/null +++ b/datasets/newsgroups/stop.sh @@ -0,0 +1,12 @@ +# stop.sh +# +# Generate proper stop words list from the 'stop.txt' file. + + +# Original source: http://snowball.tartarus.org/algorithms/english/stop.txt +# NOTE: in our experiments, stop.txt has been modified to include the last stop +# words (stop.txt is included). + +sed 's/|.*//g' <stop.txt \ + | sed 's/ \+//g' \ + | sed '/^$/d' >words.txt |