Tokenize the stop list

That was the phrase that I googled, yahooed, and binged, to start this assignment.

At first I started to look around for information on stop lists and got to some interesting links describing methods to go through text, a downloaded a pdf on “information Retrieval” which was very good but went a little bit too far from where we stand right now in class.

Still some of the algorithms went on something like this:
1.remove punctuation
2.detect words
3.remove the, to, the,…(stop ist)
4. use the stemmer
5.use a ‘lemmatizer” to remove endings like -ies, -ed…

A more general algorithm showed in the Info Retrieval pdf was:
tokenize->stemmer->stopList->index

I looked for list of words for stop list and were able to find some good compilations and suggestions, which in some cases I had to prepare using Excel and then get them together in notepad, and do some find and replace actions to get the desired format of “word1″,”word2″,”word3″… that Heathers example used.

One thing I had to do was to remove as many repeated words as I could (but this would be taken care of by the java program anyways…), BUT MOST IMPORTANT, these lists of words would have to be really be thought through and word picked depending on the final use, or intention of the information being processed

Googles algorithm has gone from 16 filters in 2006 to more than 120 filters in 2010, my list compared with googles list, feels like using a chain saw to cut your toe nails, because it is not taking into account any other tipe of context that might help filter the list in order to get a better.

from:
google, yahoo, bing algorithms

google stop words

Actually the google stop word list I found is pretty short. And from what I read about the other filters, it changes according to them

The comparison with Yahoo and Bing…I was really surprised to find out that Yahoo, uses Bing to throw its results!!!, I did think hey where very similar, but the I read an article that stated that yahoo would be making bing its search engine, and checked the results and they were the same… this… was really surprising. Also their 2006 algorithim had half of the filters that google has.

And google just keeps adding and modying its algorithm!, not that this makes it better… you can still take advanage of it by having “bad manners” as the JC penny case last Christmas, where JC penny was the top search result for many, many , many searches. This was done by using the part of the algorithms filters “against it self”…but in the end, the “damage” was done, and google is taking steps to “bann” JC penny, but this does no guaranty that it cannot be done again.

Another article stated that Bing was trying to “compete” with google not only with its search engine alone, but getting more services incorporated on some special types of searches like air tickets. So speciallized search ara compettitive, but quick reference generalized search and bringing good context results, is on the google realm still.. and it doesnt look it is going anywhere.

Googles capability of “knowing” what you want, seems like magic … so I think a good stop list, is good to have around, but for more successful results in WWW searches, this is naive to think that a stop list is going to really help with out the use of other type of filters to help you make that list flexible.

A simple but curious Note: Google is still the only one of these search engines that has become a verb in the common talk.
AND it is incredible w=how much of the web is not visible by any of these efforts, is like just knowing the world by your phonebook. scary

Some first attempt of code – want to work on the tokenizer part,
I am using the JAVA StringTokenizer functions to remove punctuation(or will it be better to use regular expressions??):
package com.lingpipe.book.tok;

import java.util.Set;
import java.util.StringTokenizer;

import com.aliasi.tokenizer.EnglishStopTokenizerFactory;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.LowerCaseTokenizerFactory;
import com.aliasi.tokenizer.StopTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.util.CollectionUtils;
import com.lingpipe.book.tok.DisplayTokens;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class nelsonTokenizerStopList {
/*NELSON:
* before the Indo-European Tokenizer, i am using StringtoTokenizer to remove some, punctuation points
* then the algorithm goes as heather’s
* HEATHER:
* the Indo-European tokenizer will tokenize, the resulting tokens will be converted to lower case, and
then stop words will be removed.
*/

public static void main(String[] args) {
String text = “This list is a test to see if the stop list is chopping words that do not help the search engine”;
// first I want to get rid of the punctuation
//tke out :
StringTokenizer st = new StringTokenizer(text, “:”);
String aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out [
text = aux;
st = new StringTokenizer(text, “[“);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out [
text = aux;
st = new StringTokenizer(text, "]“);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out ‘
text = aux;
st = new StringTokenizer(text, “‘”);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out ;
text = aux;
st = new StringTokenizer(text, “;”);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out,
text = aux;
st = new StringTokenizer(text, “,”);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out !
text = aux;
st = new StringTokenizer(text, “!”);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out @
text = aux;
st = new StringTokenizer(text, “@”);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out $
text = aux;
st = new StringTokenizer(text, “$”);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out &
text = aux;
st = new StringTokenizer(text, “&”);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out *
text = aux;
st = new StringTokenizer(text, “*”);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out (
text = aux;
st = new StringTokenizer(text, “(“);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out )
text = aux;
st = new StringTokenizer(text, “)”);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out -
text = aux;
st = new StringTokenizer(text, “-“);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out _
text = aux;
st = new StringTokenizer(text, “_”);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out +
text = aux;
st = new StringTokenizer(text, “+”);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//take out =
text = aux;
st = new StringTokenizer(text, “=”);
aux = “”;
while (st.hasMoreTokens()){
aux = aux + st.nextToken();
aux = aux + ” “; //so that it does not loose a space
}
//then token words with white spaces
//from this words take away the ones on the stop list

Set stopSet2 = CollectionUtils.asSet (“able”,”about”,”above”,”abroad”,”according”,”accordingly”,”across”,”actually”,”adj”,”after”,”afterwards”,”again”,
“against”,”abroad”,”ahead”,”ain’t”,”all”,”allow”,”allows”,”almost”,”alone”,”along”,”alongside”,”already”,
“also”,”although”,”always”,”am”,”amid”,”amidst”,”among”,”amongst”,”alongside”,”and”,”another”,”any”,
“anybody”,”anyhow”,”anyone”,”anything”,”anyway”,”anyways”,”anywhere”,”apart”,”appear”,”appreciate”,
“appropriate”,”are”,”aren’t”,”around”,”as”,”anyways”,”aside”,”ask”,”asking”,”associated”,”at”,”available”,
“away”,”awfully”,”back”,”backward”,”backwards”,”be”,”became”,”because”,”become”,”becomes”,”becoming”,
“been”,”back”,”beforehand”,”begin”,”behind”,”being”,”believe”,”below”,”beside”,”besides”,”best”,”better”,
“between”,”beyond”,”both”,”brief”,”but”,”by”,”came”,”can”,”best”,”cant”,”can’t”,”caption”,”cause”,
“causes”,”certain”,”certainly”,”changes”,”clearly”,”c’mon”,”co”,”co.”,”com”,”come”,”comes”,”concerning”,
“consequently”,”consider”,”clearly”,”contain”,”containing”,”contains”,”corresponding”,”could”,”couldn’t”,
“course”,”c’s”,”currently”,”dare”,”daren’t”,”definitely”,”described”,”despite”,”did”,”didn’t”,”different”,
“directly”,”currently”,”does”,”doesn’t”,”doing”,”done”,”don’t”,”down”,”downwards”,”during”,”each”,”edu”,
“eg”,”eight”,”eighty”,”either”,”else”,”elsewhere”,”end”,”ending”,”each”,”entirely”,”especially”,”et”,
“etc”,”even”,”ever”,”evermore”,”every”,”everybody”,”everyone”,”everything”,”everywhere”,”ex”,”exactly”,
“example”,”except”,”fairly”,”far”,”everybody”,”few”,”fewer”,”fifth”,”first”,”five”,”followed”,”following”,
“follows”,”for”,”forever”,”former”,”formerly”,”forth”,”forward”,”found”,”four”,”from”,”further”,”for”,
“get”,”gets”,”getting”,”given”,”gives”,”go”,”goes”,”going”,”gone”,”got”,”gotten”,”greetings”,”had”,
“hadn’t”,”half”,”happens”,”hardly”,”has”,”gone”,”have”,”haven’t”,”having”,”he”,”he’d”,”he’ll”,”hello”,
“help”,”hence”,”her”,”here”,”hereafter”,”hereby”,”herein”,”here’s”,”hereupon”,”hers”,”herself”,”hence”,
“hi”,”him”,”himself”,”his”,”hither”,”hopefully”,”how”,”howbeit”,”however”,”hundred”,”i’d”,”ie”,”if”,
“ignored”,”i’ll”,”i’m”,”immediate”,”in”,”however”,”inc”,”inc.”,”indeed”,”indicate”,”indicated”,
“indicates”,”inner”,”inside”,”insofar”,”instead”,”into”,”inward”,”is”,”isn’t”,”it”,”it’d”,”it’ll”,
“its”,”insofar”,”itself”,”i’ve”,”just”,”k”,”keep”,”keep”,”keeps”,”kept”,”know”,”known”,”knows”,”last”,
“lately”,”later”,”latter”,”latterly”,”least”,”less”,”know”,”let”,”let’s”,”like”,”liked”,”likely”,
“likewise”,”little”,”look”,”looking”,”looks”,”low”,”lower”,”ltd”,”made”,”mainly”,”make”,”makes”,”many”,
“looking”,”maybe”,”mayn’t”,”me”,”mean”,”meantime”,”meanwhile”,”merely”,”might”,”mightn’t”,”mine”,”minus”,
“miss”,”more”,”moreover”,”most”,”mostly”,”mr”,”mrs”,”mightn’t”,”must”,”mustn’t”,”my”,”myself”,”name”,
“namely”,”nd”,”near”,”nearly”,”necessary”,”need”,”needn’t”,”needs”,”neither”,”never”,”neverf”,”neverless”,
“nevertheless”,”nearly”,”next”,”nine”,”ninety”,”no”,”nobody”,”non”,”none”,”nonetheless”,”noone”,
“no-one”,”nor”,”normally”,”not”,”nothing”,”notwithstanding”,”novel”,”now”,”nowhere”,”noone”,”of”,”off”,
“often”,”oh”,”ok”,”okay”,”old”,”on”,”once”,”one”,”ones”,”one’s”,”only”,”onto”,”opposite”,”or”,”other”,
“others”,”once”,”ought”,”oughtn’t”,”our”,”ours”,”ourselves”,”out”,”outside”,”over”,”overall”,”own”,
“particular”,”particularly”,”past”,”per”,”perhaps”,”placed”,”please”,”plus”,”overall”,”presumably”,
“probably”,”provided”,”provides”,”que”,”quite”,”qv”,”rather”,”rd”,”re”,”really”,”reasonably”,”recent”,
“recently”,”regarding”,”regardless”,”regards”,”relatively”,”rd”,”right”,”round”,”said”,”same”,”saw”,
“say”,”saying”,”says”,”second”,”secondly”,”see”,”seeing”,”seem”,”seemed”,”seeming”,”seems”,”seen”,”self”,
“second”,”sensible”,”sent”,”serious”,”seriously”,”seven”,”several”,”shall”,”shan’t”,”she”,”she’d”,
“she’ll”,”she’s”,”should”,”shouldn’t”,”since”,”six”,”so”,”some”,”she”,”someday”,”somehow”,”someone”,
“something”,”sometime”,”sometimes”,”somewhat”,”somewhere”,”soon”,”sorry”,”specified”,”specify”,
“specifying”,”still”,”sub”,”such”,”sup”,”sure”,”soon”,”taken”,”taking”,”tell”,”tends”,”th”,”than”,”thank”,
“thanks”,”thanx”,”that”,”that’ll”,”thats”,”that’s”,”that’ve”,”the”,”their”,”theirs”,”them”,”thanx”,”then”,
“thence”,”there”,”thereafter”,”thereby”,”there’d”,”therefore”,”therein”,”there’ll”,”there’re”,”theres”,
“there’s”,”thereupon”,”there’ve”,”these”,”they”,”they’d”,”they’ll”,”there’ll”,”they’ve”,”thing”,”things”,
“think”,”third”,”thirty”,”this”,”thorough”,”thoroughly”,”those”,”though”,”three”,”through”,”throughout”,
“thru”,”thus”,”till”,”to”,”thoroughly”,”too”,”took”,”toward”,”towards”,”tried”,”tries”,”truly”,”try”,
“trying”,”t’s”,”twice”,”two”,”un”,”under”,”underneath”,”undoing”,”unfortunately”,”unless”,”trying”,
“unlikely”,”until”,”unto”,”up”,”upon”,”upwards”,”us”,”use”,”used”,”useful”,”uses”,”using”,”usually”,
“v”,”value”,”various”,”versus”,”very”,”used”,”viz”,”vs”,”want”,”wants”,”was”,”wasn’t”,”way”,”we”,”we’d”,
“welcome”,”well”,”we’ll”,”went”,”were”,”we’re”,”weren’t”,”we’ve”,”what”,”we’d”,”what’ll”,”what’s”,
“what’ve”,”when”,”whence”,”whenever”,”where”,”whereafter”,”whereas”,”whereby”,”wherein”,”where’s”,
“whereupon”,”wherever”,”whether”,”which”,”whichever”,”while”,”whereas”,”whither”,”who”,”who’d”,
“whoever”,”whole”,”who’ll”,”whom”,”whomever”,”who’s”,”whose”,”why”,”will”,”willing”,”wish”,”with”,
“within”,”without”,”wonder”,”who’s”,”would”,”wouldn’t”,”yes”,”yet”,”you”,”you’d”,”you’ll”,”your”,
“you’re”,”yours”,”yourself”,”yourselves”,”you’ve”);
Set stopSet = CollectionUtils.asSet(“a”,”able”,”about”,”across”,”after”,”all”,”almost”,”also”,”am”,
“among”,”an”,”and”,”any”,”are”,”as”,”at”,”be”,”because”,”been”,”but”,”by”,”can”,”cannot”,”could”,”dear”,
“did”,”do”,”does”,”either”,”else”,”ever”,”every”,”for”,”from”,”get”,”got”,”had”,”has”,”have”,”he”,”her”,
“hers”,”him”,”his”,”how”,”however”,”i”,”if”,”in”,”into”,”is”,”it”,”its”,”just”,”least”,”let”,”like”,
“likely”,”may”,”me”,”might”,”most”,”must”,”my”,”neither”,”no”,”nor”,”not”,”of”,”off”,”often”,”on”,”only”,
“or”,”other”,”our”,”own”,”rather”,”said”,”say”,”says”,”she”,”should”,”since”,”so”,”some”,”than”,”that”,
“the”,”their”,”them”,”then”,”there”,”these”,”they”,”this”,”tis”,”to”,”too”,”twas”,”us”,”wants”,”was”,”we”,
“were”,”what”,”when”,”where”,”which”,”while”,”who”,”whom”,”why”,”will”,”with”,”would”,”yet”,”you”,”your”,
“’tis”,”’twas”,”a”,”able”,”about”,”across”,”after”,”ain’t”,”all”,”almost”,”also”,”am”,”among”,”an”,”and”,
“any”,”are”,”aren’t”,”as”,”at”,”be”,”because”,”been”,”but”,”by”,”can”,”can’t”,”cannot”,”could”,”could’ve”,
“couldn’t”,”dear”,”did”,”didn’t”,”do”,”does”,”doesn’t”,”don’t”,”either”,”else”,”ever”,”every”,”for”,
“from”,”get”,”got”,”had”,”hasn’t”,”have”,”he”,”he’d”,”he’ll”,”he’s”,”her”,”hers”,”him”,”his”,”how”,
“how’d”,”how’ll”,”how’s”,”however”,”i”,”i’d”,”i’ll”,”i’m”,”i’ve”,”if”,”in”,”into”,”is”,”isn’t”,”it”,”it’s”,
“its”,”just”,”least”,”let”,”like”,”likely”,”may”,”me”,”might”,”might’ve”,”mightn’t”,”most”,”must’ve”,
“mustn’t”,”my”,”neither”,”no”,”nor”,”not”,”of”,”off”,”often”,”on”,”only”,”or”,”other”,”our”,”own”,
“rather”,”said”,”say”,”says”,”shan’t”,”she”,”she’d”,”she’ll”,”she’s”,”should”,”should’ve”,”shouldn’t”,
“since”,”so”,”some”,”than”,”that”,”that’ll”,”that’s”,”the”,”their”,”them”,”then”,”there”,”there’s”,
“these”,”they”,”they’d”,”they’ll”,”they’re”,”they’ve”,”this”,”tis”,”to”,”too”,”twas”,”us”,”wants”,”was”,
“wasn’t”,”we”,”we’d”,”we’ll”,”we’re”,”were”,”weren’t”,”what”,”what’d”,”what’s”,”when’d”,”when’ll”,
“when’s”,”where”,”where’d”,”where’ll”,”where’s”,”which”,”while”,”who”,”who’d”,”who’ll”,”who’s”,”whom”,
“why”,”why’d”,”why’ll”,”why’s”,”will”,”with”,”won’t”,”would”,”would’ve”,”wouldn’t”,”yet”,”you”,”you’d”,
“you’ll”,”you’re”,”you’ve”,”’tis”,”’twas”,”ain’t”,”aren’t”,”can’t”,”could’ve”,”couldn’t”,”didn’t”,
“doesn’t”,”don’t”,”he’s”,”how’d”,”how’ll”,”how’s”,”i’d”,”i’ll”,”i’m”,”i’ve”,”isn’t”,”it’s”,”might’ve”,
“mightn’t”,”shan’t”,”she’d”,”she’ll”,”she’s”,”should’ve”,”shouldn’t”,”that’ll”,”that’s”,”there’s”,
“they’d”,”they’ll”,”they’re”,”they’ve”,”wasn’t”,”we’d”,”we’ll”,”we’re”,”weren’t”,”what’d”,”what’s”,
“where’d”,”where’ll”,”where’s”,”who’d”,”who’ll”,”who’s”,”why’d”,”why’ll”,”why’s”,”won’t”,”would’ve”,
“wouldn’t”,”you’d”,”you’ll”,”you’re”,”you’ve”,”I”,”com”,”de”,”en”,”la”,”und”,”www”);

stopSet.addAll(stopSet2); // bind the two lists if it is wanted!!!!!!!!!!!!!

TokenizerFactory f1 = IndoEuropeanTokenizerFactory.INSTANCE;
TokenizerFactory f2 = new LowerCaseTokenizerFactory(f1);
TokenizerFactory f3 = new StopTokenizerFactory(f2,stopSet);
//I want ot use my own stop lists so commented this lines
//could try EnglishStopTokenizerFactory instead here
//TokenizerFactory f3 = new EnglishStopTokenizerFactory(f2); // with stop list
///////////////////////

DisplayTokens.displayTokens(text,f3);
}

}

One Response to Tokenize the stop list

  1. Emilio says:

    You made some decent points there. I looked on the net for more info about the issue and
    found most people wil ggo along with your views on this website.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s