I stumbled upon one entry where the regular view looks like proper paragraphs and sentences; but when I proceed to correct, 95% of the entry collapse into a single entry field. How does that happen, and, more importantly, how can I work with that sensibly?
If I remember correctly, itโs either a limitation in the NLTK sentence tokenization, or the stops (periods) inside of the parenthesis are throwing off the script that splits sentences.
Itโs honestly pretty hard to implement a lot of edge cases, so if someone who is more knowledgeable in NLP than me that wants to give it a shot, let me know! Iโll provide the entire python script that splits the sentences, and Iโll be more than happy to give credit ^^
Thereโs honestly not much you can do without the original author fixing it via editing their post. Thereโs also a sentence split preview which they havenโt checked prior to submission. It would just have to be corrected as is unfortunately
If you point me at the code Iโm happy to have a look; not that Iโm a python expert. If I stare at it long enough (and it has some reasonable comments) I may be able to understand and possibly even fix it
Okay, I tried to attach the entire utils.py
file, but since the extension is in Discourseโs list of not allowed extensions, I will just copy and paste the code below. However, if you would like the file instead, please let me know and I will attach it and send it to you via email. I have your email on file, so if you want to go that route, please donโt write your email here. I donโt want personal information to be visible on the platform.
Hereโs the code (you should be able to select all the text in this box, copy, and then paste it inside of .py file.
def split_sentences_east_asian(text):
cases = ['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'St.', 'Jr.', 'Sr.', 'P.', 'S.',
'Jan.', 'Feb.', 'Mar.', 'Apr.', 'May.', 'Jun.', 'Jul.', 'Aug.', 'Sep.', 'Oct.', 'Nov.', 'Dec.']
post_text = find_urls(text)
post_text = post_text.strip()
sentences = []
double_quote = 0
parenthesis = 0
j_p = 0
bracket = 0
url = 0
def exit_dot(text, pos):
if j_p != 0 or double_quote != 0 or parenthesis != 0 or bracket != 0 or url != 0 or \
pos < len(text) - 1 and text[pos - 1].isdigit() and text[pos + 1].isdigit():
return False
if text[pos] in ['.', 'ใ'] and pos < len(text) - 1 and text[pos + 1] in ['.', 'ใ']:
return False
if pos != 0 and text[pos - 1] in ['.', 'ใ'] and text[pos] in ['.', 'ใ'] and pos < len(text) - 1 and \
text[pos + 1] != ' ':
return False
if text[pos] in ['.', 'ใ'] and pos >= 2 and text[pos - 1].isdigit() and text[pos - 2].isupper(): # Q1. xxx
return False
if text[pos] in ['.', 'ใ'] and pos >= 1 and text[pos - 1].isdigit(): # 1. xxx
return False
# Check for special cases
for i in range(len(cases)):
case = cases[i]
if len(case) <= pos + 1 and case == text[pos - len(case) + 1:pos + 1]:
return False
return True
sub_sentence = ''
for idx in range(len(post_text)):
sub_sentence += post_text[idx]
if post_text[idx] == 'ใ':
j_p += 1
if post_text[idx] == 'ใ':
j_p -= 1
if post_text[idx] in ['"', 'โ', 'โ', 'โ', 'ใ']:
double_quote = 1 - double_quote
if post_text[idx] == '(' or post_text[idx] == '๏ผ':
if idx == 0 or (idx and post_text[idx - 1] != ':'): # :(
parenthesis += 1
if post_text[idx] == ')' or post_text[idx] == '๏ผ':
if idx and post_text[idx - 1] != ':': # :)
parenthesis -= 1
if post_text[idx] == '[':
bracket += 1
if post_text[idx] == ']':
bracket -= 1
if post_text[idx] == '<' and idx + 6 < len(post_text) and post_text[idx: idx + 7] == '<<url>>':
url = 1 - url
if idx != 0 and post_text[idx] in ['\n', '\r', '.', '!', '?', 'ใ', '๏ผ', '๏ผ', 'ุ'] and exit_dot(post_text, idx):
sub_sentence = sub_sentence.strip().replace('<<url>>', '')
if len(sub_sentence) and sub_sentence not in ['===', '---']:
sentences.append(sub_sentence)
sub_sentence = ''
idx += 1
if sub_sentence:
sub_sentence = sub_sentence.strip().replace('<<url>>', '')
if len(sub_sentence) and sub_sentence not in ['===', '---']:
sentences.append(sub_sentence)
return sentences
def split_sentences(text, language):
import nltk
import re
nltk_support_languages = {
'Czech': 'czech',
'Danish': 'danish',
'Dutch': 'dutch',
'English': 'english',
'Estonian': 'estonian',
'Finnish': 'finnish',
'French': 'french',
'German': 'german',
'Greek': 'greek',
'Italian': 'italian',
'Norwegian': 'norwegian',
'Polish': 'polish',
'Portuguese': 'portuguese',
'Russian': 'russian',
# 'Slovene': 'slovene', # not exist in language model, but in nltk
'Spanish': 'spanish',
'Swedish': 'swedish',
'Turkish': 'turkish',
}
name_re = re.compile('[a-zA-Z][.][a-zA-Z][.]')
temp = []
post_text = text.replace('\n', '<<<end>>>\n')
post_text = post_text.replace('\r', '<<<end>>>\r')
post_text = post_text.replace('\n\r', '<<<end>>>\n\r')
post_text = post_text.replace('\r\n', '<<<end>>>\r\n')
for a in post_text.split('\n'):
if language in nltk_support_languages:
temp += nltk.sent_tokenize(a, nltk_support_languages[language])
else:
temp += nltk.sent_tokenize(a)
sentences = []
sentence = ""
parenthesis = 0
for idx, a in enumerate(temp):
for i in range(len(a)):
if a[i] == '(' and (i != 0 and a[i - 1] != ':' or i == 0): # :(
parenthesis += 1
if a[i] == ')' and (i != 0 and a[i - 1] != ':' or i == 0): # :)
parenthesis -= 1
sentence = sentence + ('' if len(sentence) == 0 or (sentence[-1] == '!' and a[0] == '!') else ' ') + a
if name_re.match(a[-4:]):
continue
if len(a) > 1 and a[0].isdigit() and a[1] == '.': # 1. xxx
continue
if len(a) > 2 and a[0].isalpha() and a[1].isdigit() and a[2] == '.': # Q1. xxx
continue
if len(a) > 3 and a[0].isalpha() and a[1].isdigit() and a[2].isdigit() and a[3] == '.': # Q10. xxx
continue
sentence = sentence.replace('<<<end>>>', '')
if parenthesis < 0 and sentence.strip():
sentences.append(sentence.strip())
sentence = ""
if a[len(a) - 1] == ')':
continue
if sentence and sentence[-1] == '!' and idx < len(temp) - 1 and temp[idx + 1][0] == '!':
continue
if parenthesis == 0 and sentence.strip():
sentences.append(sentence.strip())
sentence = ""
if len(sentence) and sentence.strip():
sentences.append(sentence.strip())
return sentences
if __name__ == '__main__':
content = '''
๋ฐฉ๊ธ "๋ฏธ๋ค๊ฒ์"์ด๋ผ๋ ์ฑ
์ ๋ค ์ฝ์์ด์. ("๋ฏธ๋ค๊ฒ์"์ด๋ผ๋ ์ ๋ชฉ "์ค๊ฐ ๋ถ๋ถ ๊ฒ์"์ด ๋ป์ด์์.) ์ฑ
์ ์ฝ์ผ๋ฉด์, ์ด์ผ๊ธฐ๊ฐ ์ ๋๋ ๊ฒ์ ๊ฐ์์ด์. ๊ณ์ ํ๋ณตํ๊ฒ ์ฝ์์ด์.
ํ์ง๋ง ์ฑ
์ ๋ค ์ฝ๋ ๊ฒ์ ํ์, ์ค๋งํ์ด์. ์ ๋ "๊ทธ๊ฒ์ด๋ ๋ค์์?"๋ผ๊ณ ์๊ฐํ์ด์. ์ด์ผ๊ธฐ๋ ๋ถ์์ ๊ฒ์ ๊ฐ์์ด์. "๋ฏธ๋ค๊ฒ์"์ด๋ผ๋ ์ ๋ชฉ์ ์์ด๋ฌ๋ํ๋๋ฐ...์ฑ
์ ์ค๊ฐ ๋ถ๋ถ๋ง ์ข์์ด์. ใ
ใ
(์ง์ง, ์ ๋ ๋๋ฌด ์ฌํ๊ฒ ๋นํํ๊ณ ์์ด์. "๋ฏธ๋ค๊ฒ์" ์ ๋๋นด์ด์. ์ ๋ ์ด ์ ์๋ฅผ ์ข์ํ๊ณ ์ ์์ ์ฑ
์ ๋ ์ฝ์ ๊ฑฐ์์.)
'''
result = split_sentences_east_asian(content)
for a in result:
print(a)
Notes
Since the problematic journal is written in German? you will need to make the following changes:
-
Copy the problematic journal text and paste its text into the
content
variable found underif __name__ == '__main__':
. Itโs at the very bottom. You will see Korean text. -
In the same if block, where the
result
variable is found, you will need to change the function call to besplit_sentences
instead ofresult = split_sentences_east_asian(content)
. -
In your environment, you will need to install NLTK. You can find instructions here: NLTK :: Installing NLTK.
If you have any questions or problems running the code, let me know and I can help.
Thank you, Iโll have a looksee when I get home.