OpenNLP - 句子检测

在处理自然语言时，确定句子的开头和结尾是需要解决的问题之一。此过程称为句子边界消歧(SBD) 或简称为断句。

我们用来检测给定文本中的句子的技术取决于文本的语言。

使用 Java 进行句子检测

我们可以使用正则表达式和一组简单的规则来检测 Java 中给定文本中的句子。

例如，假设给定文本中的句子以句点、问号或感叹号结尾，那么我们可以使用String类的split()方法分割句子。在这里，我们必须传递字符串格式的正则表达式。

以下是使用 Java 正则表达式（分割方法）确定给定文本中的句子的程序。将此程序保存在名为SentenceDetection_RE.java的文件中。

public class SentenceDetection_RE {  
   public static void main(String args[]){ 
     
      String sentence = " Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
     
      String simple = "[.?!]";      
      String[] splitString = (sentence.split(simple));     
      for (String string : splitString)   
         System.out.println(string);      
   } 
}

使用以下命令从命令提示符编译并执行保存的 java 文件。

javac SentenceDetection_RE.java 
java SentenceDetection_RE

执行时，上述程序会创建一个 PDF 文档，显示以下消息。

Hi 
How are you 
Welcome to Tutorialspoint 
We provide free tutorials on various technologies

使用 OpenNLP 进行句子检测

为了检测句子，OpenNLP 使用预定义的模型，即名为en-sent.bin 的文件。该预定义模型经过训练可以检测给定原始文本中的句子。

opennlp.tools.sentdetect包包含用于执行句子检测任务的类和接口。

要使用 OpenNLP 库检测句子，您需要 -

使用SentenceModel类加载en-sent.bin模型
实例化SentenceDetectorME类。
使用此类的sentDetect()方法检测句子。

以下是编写一个从给定原始文本中检测句子的程序所需遵循的步骤。

第 1 步：加载模型

句子检测模型由名为SentenceModel的类表示，该类属于opennlp.tools.sentDetect包。

加载句子检测模型 -

创建模型的InputStream对象（实例化FileInputStream并将模型的路径以字符串格式传递给其构造函数）。
实例化SentenceModel类并将模型的InputStream （对象）作为参数传递给其构造函数，如以下代码块所示 -

//Loading sentence detector model 
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/ensent.bin"); 
SentenceModel model = new SentenceModel(inputStream);

步骤 2：实例化 SentenceDetectorME 类

opennlp.tools.sentDetect包的 SentenceDetectorME 类包含将原始文本拆分为句子的方法。此类使用最大熵模型来评估字符串中的句尾字符，以确定它们是否表示句子的结尾。

实例化该类并传递上一步中创建的模型对象，如下所示。

//Instantiating the SentenceDetectorME class 
SentenceDetectorME detector = new SentenceDetectorME(model);

第三步：检测句子

SentenceDetectorME类的sentDetect ()方法用于检测传递给它的原始文本中的句子。该方法接受一个字符串变量作为参数。

通过将句子的字符串格式传递给此方法来调用此方法。

//Detecting the sentence 
String sentences[] = detector.sentDetect(sentence);

例子

以下是检测给定原始文本中的句子的程序。将此程序保存在名为SentenceDetectionME.java的文件中。

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel;  

public class SentenceDetectionME { 
  
   public static void main(String args[]) throws Exception { 
   
      String sentence = "Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
    
      //Detecting the sentence
      String sentences[] = detector.sentDetect(sentence); 
    
      //Printing the sentences 
      for(String sent : sentences)        
         System.out.println(sent);  
   } 
}

使用以下命令从命令提示符编译并执行保存的 Java 文件 -

javac SentenceDetectorME.java 
java SentenceDetectorME

执行时，上述程序读取给定的字符串并检测其中的句子并显示以下输出。

Hi. How are you? 
Welcome to Tutorialspoint. 
We provide free tutorials on various technologies

检测句子的位置

我们还可以使用SentenceDetectorME 类的 sentPosDetect() 方法来检测句子的位置。

以下是编写一个程序来检测给定原始文本中句子位置的步骤。

第 1 步：加载模型

句子检测模型由名为SentenceModel的类表示，该类属于opennlp.tools.sentDetect包。

加载句子检测模型 -

创建模型的InputStream对象（实例化FileInputStream并将模型的路径以字符串格式传递给其构造函数）。
实例化SentenceModel类并将模型的InputStream （对象）作为参数传递给其构造函数，如以下代码块所示。

//Loading sentence detector model 
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
SentenceModel model = new SentenceModel(inputStream);

步骤 2：实例化 SentenceDetectorME 类

实例化此类并传递上一步中创建的模型对象。

//Instantiating the SentenceDetectorME class 
SentenceDetectorME detector = new SentenceDetectorME(model);

第三步：检测句子的位置

SentenceDetectorME类的sentPosDetect ()方法用于检测传递给它的原始文本中句子的位置。该方法接受一个字符串变量作为参数。

通过将句子的字符串格式作为参数传递给该方法来调用该方法。

//Detecting the position of the sentences in the paragraph  
Span[] spans = detector.sentPosDetect(sentence);

第四步：打印句子的跨度

SentenceDetectorME类的sentPosDetect ()方法返回Span类型的对象数组。opennlp.tools.util包中名为 Span 的类用于存储集合的起始和结束整数。

您可以将sentPosDetect()方法返回的span存储在Span数组中并打印它们，如以下代码块所示。

//Printing the sentences and their spans of a sentence 
for (Span span : spans)         
System.out.println(paragraph.substring(span);

例子

以下是检测给定原始文本中的句子的程序。将此程序保存在名为SentenceDetectionME.java的文件中。

import java.io.FileInputStream; 
import java.io.InputStream; 
  
import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel; 
import opennlp.tools.util.Span;

public class SentencePosDetection { 
  
   public static void main(String args[]) throws Exception { 
   
      String paragraph = "Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
       
      //Detecting the position of the sentences in the raw text 
      Span spans[] = detector.sentPosDetect(paragraph); 
       
      //Printing the spans of the sentences in the paragraph 
      for (Span span : spans)         
         System.out.println(span);  
   } 
}

使用以下命令从命令提示符编译并执行保存的 Java 文件 -

javac SentencePosDetection.java 
java SentencePosDetection

执行时，上述程序读取给定的字符串并检测其中的句子并显示以下输出。

[0..16) 
[17..43) 
[44..93)

句子及其位置

String 类的 substring() 方法接受开始和结束偏移量并返回相应的字符串。我们可以使用此方法将句子及其跨度（位置）打印在一起，如以下代码块所示。

for (Span span : spans)         
   System.out.println(sen.substring(span.getStart(), span.getEnd())+" "+ span);

以下是从给定的原始文本中检测句子并显示它们及其位置的程序。将此程序保存在名为SentencesAndPosDetection.java的文件中。

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel; 
import opennlp.tools.util.Span; 
   
public class SentencesAndPosDetection { 
  
   public static void main(String args[]) throws Exception { 
     
      String sen = "Hi. How are you? Welcome to Tutorialspoint." 
         + " We provide free tutorials on various technologies"; 
      //Loading a sentence model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
       
      //Detecting the position of the sentences in the paragraph  
      Span[] spans = detector.sentPosDetect(sen);  
      
      //Printing the sentences and their spans of a paragraph 
      for (Span span : spans)         
         System.out.println(sen.substring(span.getStart(), span.getEnd())+" "+ span);  
   } 
}

使用以下命令从命令提示符编译并执行保存的 Java 文件 -

javac SentencesAndPosDetection.java 
java SentencesAndPosDetection

执行时，上述程序读取给定的字符串并检测句子及其位置，并显示以下输出。

Hi. How are you? [0..16) 
Welcome to Tutorialspoint. [17..43)  
We provide free tutorials on various technologies [44..93)

句子概率检测

SentenceDetectorME类的getSentenceProbabilities ()方法返回与最近调用 sentDetect() 方法关联的概率。

//Getting the probabilities of the last decoded sequence       
double[] probs = detector.getSentenceProbabilities();

以下是打印与 sendDetect() 方法调用相关的概率的程序。将此程序保存在名为SentenceDetectionMEProbs.java的文件中。

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel;  

public class SentenceDetectionMEProbs { 
  
   public static void main(String args[]) throws Exception { 
   
      String sentence = "Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin");
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class
      SentenceDetectorME detector = new SentenceDetectorME(model);  
      
      //Detecting the sentence 
      String sentences[] = detector.sentDetect(sentence); 
    
      //Printing the sentences 
      for(String sent : sentences)        
         System.out.println(sent);   
         
      //Getting the probabilities of the last decoded sequence       
      double[] probs = detector.getSentenceProbabilities(); 
       
      System.out.println("  "); 
       
      for(int i = 0; i<probs.length; i++) 
         System.out.println(probs[i]); 
   } 
}

使用以下命令从命令提示符编译并执行保存的 Java 文件 -

javac SentenceDetectionMEProbs.java 
java SentenceDetectionMEProbs

在执行时，上面的程序读取给定的字符串并检测句子并打印它们。此外，它还返回与最近调用 sentDetect() 方法相关的概率，如下所示。

Hi. How are you? 
Welcome to Tutorialspoint. 
We provide free tutorials on various technologies 
   
0.9240246995179983 
0.9957680129995953 
1.0