Dynamic Programming - DPV 6.4: an explanation for the sentence with corrupted text problem|downey.io

This was an ungraded practice problem for our CS6515 class. There are existing solutions out there, but I found them a bit unintuitive so this is my attempt at explaining the problem a little differently.

Problem Summary

You are given a string s containing n characters s[1…n]. The string is corrupted such that all punctuation has been removed. For example, the string “you are a bold one.” would appear as “youareaboldone”. You are also given a dictionary method, dict(w), that is able to determine whether or not a substring w of s is a valid word. In other words, dict(“bold”) would return true, while dict(“aboldo”) would return false. You are tasked with doing the following:

Give a dynamic programming algorithm that can tell whether or not a string s contains a sentence of valid words. This algorithm should take at most O(n²) time.
If the string is valid, output the sequence of words.

Refer to page 178 of the textbook for the full, formalized problem description.

Solution

I’m going to go about solving this problem similarly to my knapsack problem explanation and do the following:

State the subproblems
Define the recurrence
Describe the memoization table structure
Whip up some pseudocode
Perform some quick Big O analysis of the algorithm

Subproblem

Dynamic programming is all about identifying smaller subproblems, solving them, and using them to build up to the final solution. For this problem we can build off of smaller prefixes of s that are of length i where 0 ≤ i ≤ n.

Let’s define our subproblem as:

V(i) = whether or not (true/false) the string s contains only valid words that exist in the dictionary in the string s₁…s_i

Recurrence

The substring of length 0 (empty string) is considered a valid word so our base case starts with the sentence being valid.

Base Case: V(0) = true

Since we need to check the status of substrings within s₁…s_i in our dictionary we will need to introduce another variable j such that 0 ≤ i ≤ i. Our recurrence will look as follows:

Recurrence: V(i) = any( for(j in 1…i) { dict(s[j…i]) AND V(j - 1) } )

So what is this saying? Basically for all characters in s from 1 to i we will loop over the possible substrings using j. If the substring of s from j to i is a valid word in the dictionary and we recorded the previous word was valid by checking the ending of the previous substring via V(j - 1).

Memoization Table Structure

Given that our subproblem definition takes just a single parameter, i, we should be able to get away with a 1-dimensional memoization table of length to contain partial values for strings of length 0 to n.

Pseudocode Implementation

I decided to implement this one in Python instead of just pseudocode. You should be able to run the following on Repl.It using their Python 3 interpreter.

Part 1 - Is the Sentence Valid?

# Base case
t[0] = True

# Recurrence
for i in range(1, n):
  t[i] = False
  for j in range(1, i+1):
    # i and j correspond to our memoization array t
    # which contains an extra element to represent the
    # empty substring. We must adjust them to work with
    # the s character array.
    si = i - 1
    sj = j - 1

    substr = "".join(s[sj:si+1])

    if d[substr] and t[j-1]:
      t[i] = True
      next_word_idx[j-1] = i

print("The string s contains a valid sentence:")
print(t[n-1])

Part 2 - Output the Words

# Extract the words back out
prev = 0
words = []
for i in next_word_idx:
  if i != 0:
    word = s[prev:i]
    words.append("".join(word))
    prev = i

print("")
print("The words in the sentence are:")
print(words)

Runnable Code

You can see the code in action in the following Python repl:

Big O Analysis

As you can see in the Python pseudocode above, most of the work happens in Part 1. Here we have two nested loops, the outer for i from 1 to n and the inner for j from 1 to i. Since i is at most n, this part ends up being O(n²). In Part 2 when we extract the words back out, this is a simple loop from 0 to n which ends up being O(n). Part 1 dominates and the overall complexity is O(n²).