Parsing oddity / text not being split into sentences in correction view

I stumbled upon one entry where the regular view looks like proper paragraphs and sentences; but when I proceed to correct, 95% of the entry collapse into a single entry field. How does that happen, and, more importantly, how can I work with that sensibly?


If I remember correctly, itโ€™s either a limitation in the NLTK sentence tokenization, or the stops (periods) inside of the parenthesis are throwing off the script that splits sentences.

Itโ€™s honestly pretty hard to implement a lot of edge cases, so if someone who is more knowledgeable in NLP than me that wants to give it a shot, let me know! Iโ€™ll provide the entire python script that splits the sentences, and Iโ€™ll be more than happy to give credit ^^

Thereโ€™s honestly not much you can do without the original author fixing it via editing their post. Thereโ€™s also a sentence split preview which they havenโ€™t checked prior to submission. It would just have to be corrected as is unfortunately :frowning:

1 Like

If you point me at the code Iโ€™m happy to have a look; not that Iโ€™m a python expert. If I stare at it long enough (and it has some reasonable comments) I may be able to understand and possibly even fix it :wink:

1 Like

Okay, I tried to attach the entire utils.py file, but since the extension is in Discourseโ€™s list of not allowed extensions, I will just copy and paste the code below. However, if you would like the file instead, please let me know and I will attach it and send it to you via email. I have your email on file, so if you want to go that route, please donโ€™t write your email here. I donโ€™t want personal information to be visible on the platform.

Hereโ€™s the code (you should be able to select all the text in this box, copy, and then paste it inside of .py file.

def split_sentences_east_asian(text):

    cases = ['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'St.', 'Jr.', 'Sr.', 'P.', 'S.',

             'Jan.', 'Feb.', 'Mar.', 'Apr.', 'May.', 'Jun.', 'Jul.', 'Aug.', 'Sep.', 'Oct.', 'Nov.', 'Dec.']

    post_text = find_urls(text)

    post_text = post_text.strip()

    sentences = []

    double_quote = 0

    parenthesis = 0

    j_p = 0

    bracket = 0

    url = 0

    def exit_dot(text, pos):

        if j_p != 0 or double_quote != 0 or parenthesis != 0 or bracket != 0 or url != 0 or \

                pos < len(text) - 1 and text[pos - 1].isdigit() and text[pos + 1].isdigit():

            return False

        if text[pos] in ['.', 'ใ€‚'] and pos < len(text) - 1 and text[pos + 1] in ['.', 'ใ€‚']:

            return False

        if pos != 0 and text[pos - 1] in ['.', 'ใ€‚'] and text[pos] in ['.', 'ใ€‚'] and pos < len(text) - 1 and \

                text[pos + 1] != ' ':

            return False

        if text[pos] in ['.', 'ใ€‚'] and pos >= 2 and text[pos - 1].isdigit() and text[pos - 2].isupper():  # Q1. xxx

            return False

        if text[pos] in ['.', 'ใ€‚'] and pos >= 1 and text[pos - 1].isdigit():  # 1. xxx

            return False

        # Check for special cases

        for i in range(len(cases)):

            case = cases[i]

            if len(case) <= pos + 1 and case == text[pos - len(case) + 1:pos + 1]:

                return False

        return True

    sub_sentence = ''

    for idx in range(len(post_text)):

        sub_sentence += post_text[idx]

        if post_text[idx] == 'ใ€Œ':

            j_p += 1

        if post_text[idx] == 'ใ€':

            j_p -= 1

        if post_text[idx] in ['"', 'โ€œ', 'โ€', 'โ€ž', 'ใ€ƒ']:

            double_quote = 1 - double_quote

        if post_text[idx] == '(' or post_text[idx] == '๏ผˆ':

            if idx == 0 or (idx and post_text[idx - 1] != ':'):  # :(

                parenthesis += 1

        if post_text[idx] == ')' or post_text[idx] == '๏ผ‰':

            if idx and post_text[idx - 1] != ':':  # :)

                parenthesis -= 1

        if post_text[idx] == '[':

            bracket += 1

        if post_text[idx] == ']':

            bracket -= 1

        if post_text[idx] == '<' and idx + 6 < len(post_text) and post_text[idx: idx + 7] == '<<url>>':

            url = 1 - url

        if idx != 0 and post_text[idx] in ['\n', '\r', '.', '!', '?', 'ใ€‚', '๏ผ', '๏ผŸ', 'ุŸ'] and exit_dot(post_text, idx):

            sub_sentence = sub_sentence.strip().replace('<<url>>', '')

            if len(sub_sentence) and sub_sentence not in ['===', '---']:

                sentences.append(sub_sentence)

            sub_sentence = ''

        idx += 1

    if sub_sentence:

        sub_sentence = sub_sentence.strip().replace('<<url>>', '')

        if len(sub_sentence) and sub_sentence not in ['===', '---']:

            sentences.append(sub_sentence)

    return sentences

def split_sentences(text, language):

    import nltk

    import re

    nltk_support_languages = {

        'Czech': 'czech',

        'Danish': 'danish',

        'Dutch': 'dutch',

        'English': 'english',

        'Estonian': 'estonian',

        'Finnish': 'finnish',

        'French': 'french',

        'German': 'german',

        'Greek': 'greek',

        'Italian': 'italian',

        'Norwegian': 'norwegian',

        'Polish': 'polish',

        'Portuguese': 'portuguese',

        'Russian': 'russian',

        # 'Slovene': 'slovene',  # not exist in language model, but in nltk

        'Spanish': 'spanish',

        'Swedish': 'swedish',

        'Turkish': 'turkish',

    }

    name_re = re.compile('[a-zA-Z][.][a-zA-Z][.]')

    temp = []

    post_text = text.replace('\n', '<<<end>>>\n')

    post_text = post_text.replace('\r', '<<<end>>>\r')

    post_text = post_text.replace('\n\r', '<<<end>>>\n\r')

    post_text = post_text.replace('\r\n', '<<<end>>>\r\n')

    for a in post_text.split('\n'):

        if language in nltk_support_languages:

            temp += nltk.sent_tokenize(a, nltk_support_languages[language])

        else:

            temp += nltk.sent_tokenize(a)

    sentences = []

    sentence = ""

    parenthesis = 0

    for idx, a in enumerate(temp):

        for i in range(len(a)):

            if a[i] == '(' and (i != 0 and a[i - 1] != ':' or i == 0):  # :(

                parenthesis += 1

            if a[i] == ')' and (i != 0 and a[i - 1] != ':' or i == 0):  # :)

                parenthesis -= 1

        sentence = sentence + ('' if len(sentence) == 0 or (sentence[-1] == '!' and a[0] == '!') else ' ') + a

        if name_re.match(a[-4:]):

            continue

        if len(a) > 1 and a[0].isdigit() and a[1] == '.':  # 1. xxx

            continue

        if len(a) > 2 and a[0].isalpha() and a[1].isdigit() and a[2] == '.':  # Q1. xxx

            continue

        if len(a) > 3 and a[0].isalpha() and a[1].isdigit() and a[2].isdigit() and a[3] == '.':  # Q10. xxx

            continue

        sentence = sentence.replace('<<<end>>>', '')

        if parenthesis < 0 and sentence.strip():

            sentences.append(sentence.strip())

            sentence = ""

        if a[len(a) - 1] == ')':

            continue

        if sentence and sentence[-1] == '!' and idx < len(temp) - 1 and temp[idx + 1][0] == '!':

            continue

        if parenthesis == 0 and sentence.strip():

            sentences.append(sentence.strip())

            sentence = ""

    if len(sentence) and sentence.strip():

        sentences.append(sentence.strip())

    return sentences

if __name__ == '__main__':

    content = '''

๋ฐฉ๊ธˆ "๋ฏธ๋“ค๊ฒŒ์ž„"์ด๋ผ๋Š” ์ฑ…์„ ๋‹ค ์ฝ์—ˆ์–ด์š”. ("๋ฏธ๋“ค๊ฒŒ์ž„"์ด๋ผ๋Š” ์ œ๋ชฉ "์ค‘๊ฐ„ ๋ถ€๋ถ„ ๊ฒŒ์ž„"์ด ๋œป์ด์˜ˆ์š”.) ์ฑ…์„ ์ฝ์œผ๋ฉด์„œ, ์ด์•ผ๊ธฐ๊ฐ€ ์‹ ๋‚˜๋Š” ๊ฒƒ์„ ๊ฐ™์•˜์–ด์š”. ๊ณ„์† ํ–‰๋ณตํ•˜๊ฒŒ ์ฝ์—ˆ์–ด์š”.

ํ•˜์ง€๋งŒ ์ฑ…์„ ๋‹ค ์ฝ๋Š” ๊ฒƒ์„ ํ›„์—, ์‹ค๋งํ–ˆ์–ด์š”. ์ €๋Š” "๊ทธ๊ฒƒ์ด๋Š” ๋‹ค์˜ˆ์š”?"๋ผ๊ณ  ์ƒ๊ฐํ–ˆ์–ด์š”. ์ด์•ผ๊ธฐ๋Š” ๋ถˆ์™„์ „ ๊ฒƒ์„ ๊ฐ™์•˜์–ด์š”. "๋ฏธ๋“ค๊ฒŒ์ž„"์ด๋ผ๋Š” ์ œ๋ชฉ์€ ์•„์ด๋Ÿฌ๋‹ˆํ•˜๋Š”๋ฐ...์ฑ…์˜ ์ค‘๊ฐ„ ๋ถ€๋ถ„๋งŒ ์ข‹์•˜์–ด์š”. ใ…‹ใ…‹

(์ง„์งœ, ์ €๋Š” ๋„ˆ๋ฌด ์‹ฌํ•˜๊ฒŒ ๋น„ํ‰ํ•˜๊ณ  ์žˆ์–ด์š”. "๋ฏธ๋“ค๊ฒŒ์ž„" ์•ˆ ๋‚˜๋นด์–ด์š”. ์ €๋Š” ์ด ์ €์ž๋ฅผ ์ข‹์•„ํ•˜๊ณ  ์ €์ž์˜ ์ฑ…์„ ๋” ์ฝ์„ ๊ฑฐ์—์š”.)

    '''

    result = split_sentences_east_asian(content)

    for a in result:

        print(a)

Notes

Since the problematic journal is written in German? you will need to make the following changes:

  1. Copy the problematic journal text and paste its text into the content variable found under if __name__ == '__main__':. Itโ€™s at the very bottom. You will see Korean text.

  2. In the same if block, where the result variable is found, you will need to change the function call to be split_sentences instead of result = split_sentences_east_asian(content).

  3. In your environment, you will need to install NLTK. You can find instructions here: NLTK :: Installing NLTK.

If you have any questions or problems running the code, let me know and I can help.

1 Like

Thank you, Iโ€™ll have a looksee when I get home.