From b255338295587246292dc978e7d4d5687ee01fb4 Mon Sep 17 00:00:00 2001 From: Samuel Fadel Date: Fri, 19 Aug 2016 14:20:57 -0300 Subject: Scripts and other files for building all datasets. --- datasets/newsgroups/stop.sh | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 datasets/newsgroups/stop.sh (limited to 'datasets/newsgroups/stop.sh') diff --git a/datasets/newsgroups/stop.sh b/datasets/newsgroups/stop.sh new file mode 100644 index 0000000..36a5f74 --- /dev/null +++ b/datasets/newsgroups/stop.sh @@ -0,0 +1,12 @@ +# stop.sh +# +# Generate proper stop words list from the 'stop.txt' file. + + +# Original source: http://snowball.tartarus.org/algorithms/english/stop.txt +# NOTE: in our experiments, stop.txt has been modified to include the last stop +# words (stop.txt is included). + +sed 's/|.*//g' words.txt -- cgit v1.2.3