Unfortunately, TregexPatterns, doesn't have a quote() function yet. Hence, I wrote a small function to quote my own strings, which is not complete, but works for most strings which are found in news articles:
/** * Escapes most sequences which cause problems in Tregexpressions. * @param unSanitizedWord The un-escaped string * @return The sanitized word. */ private static String sanitizeWord(String unSanitizedWord) { String sanitizedWord = unSanitizedWord; boolean putSlashes = false; sanitizedWord = ((sanitizedWord.equals("(")) ? "/\\(/" : sanitizedWord); sanitizedWord = ((sanitizedWord.equals(")")) ? "/\\)/" : sanitizedWord); sanitizedWord = ((sanitizedWord.equals("%")) ? "/\\%/" : sanitizedWord); sanitizedWord = ((sanitizedWord.equals("&")) ? "/\\&/" : sanitizedWord); sanitizedWord = ((sanitizedWord.equals(":")) ? "/\\:/" : sanitizedWord); sanitizedWord = ((sanitizedWord.equals("$")) ? "/\\$/" : sanitizedWord); sanitizedWord = ((sanitizedWord.equals(".")) ? "/\\./" : sanitizedWord); sanitizedWord = ((sanitizedWord.equals(",")) ? "/\\,/" : sanitizedWord); if (sanitizedWord.matches("^[0-9].*")) { // It contains digits in the beginning putSlashes = true; } if (sanitizedWord.matches(".*[*+?].*")) { putSlashes = true; sanitizedWord = sanitizedWord.replaceAll("\\*", "\\\\*"); sanitizedWord = sanitizedWord.replaceAll("\\?", "\\\\?"); sanitizedWord = sanitizedWord.replaceAll("\\+", "\\\\+"); } if (sanitizedWord.endsWith(".")) { // The word is an abbreviation. sanitizedWord = sanitizedWord.substring(0, sanitizedWord.length() - 1) + "\\."; putSlashes = true; } return ((putSlashes) ? "/" + sanitizedWord + "/" : sanitizedWord); }
I am inclined to think that braces too might cause problems, but I had never encountered them in a news article.