This is interesting! I once trained a t5 model by removing newlines from Wikipedia text and it worked surprisingly well / at the time the context length was the biggest issue.
Another, not so easy to solve issue was conversational dialogue type data, which wasn’t super well represented in the training data.
I’ve always wanted to come back to working on the problem again, because I think it’s very interesting and we will have a bunch of unstructured text as a result of STT models like whisper that do a great job of transcribing/translating but generally don’t format anything.
This is interesting! I once trained a t5 model by removing newlines from Wikipedia text and it worked surprisingly well / at the time the context length was the biggest issue.
Another, not so easy to solve issue was conversational dialogue type data, which wasn’t super well represented in the training data.
I’ve always wanted to come back to working on the problem again, because I think it’s very interesting and we will have a bunch of unstructured text as a result of STT models like whisper that do a great job of transcribing/translating but generally don’t format anything.