Unfortunately, TregexPatterns, doesn't have a quote() function yet. Hence, I wrote a small function to quote my own strings, which is not complete, but works for most strings which are found in news articles:
/**
* Escapes most sequences which cause problems in Tregexpressions.
* @param unSanitizedWord The un-escaped string
* @return The sanitized word.
*/
private static String sanitizeWord(String unSanitizedWord) {
String sanitizedWord = unSanitizedWord;
boolean putSlashes = false;
sanitizedWord = ((sanitizedWord.equals("(")) ? "/\\(/" : sanitizedWord);
sanitizedWord = ((sanitizedWord.equals(")")) ? "/\\)/" : sanitizedWord);
sanitizedWord = ((sanitizedWord.equals("%")) ? "/\\%/" : sanitizedWord);
sanitizedWord = ((sanitizedWord.equals("&")) ? "/\\&/" : sanitizedWord);
sanitizedWord = ((sanitizedWord.equals(":")) ? "/\\:/" : sanitizedWord);
sanitizedWord = ((sanitizedWord.equals("$")) ? "/\\$/" : sanitizedWord);
sanitizedWord = ((sanitizedWord.equals(".")) ? "/\\./" : sanitizedWord);
sanitizedWord = ((sanitizedWord.equals(",")) ? "/\\,/" : sanitizedWord);
if (sanitizedWord.matches("^[0-9].*")) {
// It contains digits in the beginning
putSlashes = true;
}
if (sanitizedWord.matches(".*[*+?].*")) {
putSlashes = true;
sanitizedWord = sanitizedWord.replaceAll("\\*", "\\\\*");
sanitizedWord = sanitizedWord.replaceAll("\\?", "\\\\?");
sanitizedWord = sanitizedWord.replaceAll("\\+", "\\\\+");
}
if (sanitizedWord.endsWith(".")) {
// The word is an abbreviation.
sanitizedWord = sanitizedWord.substring(0, sanitizedWord.length() - 1) + "\\.";
putSlashes = true;
}
return ((putSlashes) ? "/" + sanitizedWord + "/" : sanitizedWord);
}
I am inclined to think that braces too might cause problems, but I had never encountered them in a news article.