Skip to content Skip to sidebar Skip to footer

How To Split A Text File Into Smaller Files Based On Regex Pattern?

I have a file like the following: SCN DD1251 UPSTREAM DOWNSTREAM FILTER NODE LINK NODE LINK

Solution 1:

You don't need to use a regex to do this because you can detect the gap between blocks easily by using the string strip() method.

input_file = 'Clean-Junction-Links1.txt'

with open(input_file, 'r') as file:
    i = 0
    output = None

    for line in file:
        if not line.strip():  # Blank line?
            if output:
                output.close()
            output = None
        else:
            if output is None:
                i += 1
                print(f'Creating file "{i}.txt"')
                output = open(f'{i}.txt','w')
            output.write(line)

    if output:
        output.close()

print('-fini-')

Another, cleaner and more modular, way to implement it would be to divide the processing up into two independent tasks that logically have very little to do with each other:

  1. Reading the file and grouping the lines of each a record together.
  2. Writing each group of lines to a separate file.

The first can be implemented as a generator function which iteratively collects and yields groups of lines comprising a record. It's the one named extract_records() below.

input_file = 'Clean-Junction-Links1.txt'

def extract_records(filename):
    with open(filename, 'r') as file:
        lines = []
        for line in file:
            if line.strip():  # Not blank?
                lines.append(line)
            else:
                yield lines
                lines = []
        if lines:
            yield lines

for i, record in enumerate(extract_records(input_file), start=1):
    print(f'Creating file {i}.txt')
    with open(f'{i}.txt', 'w') as output:
        output.write(''.join(record))

print('-fini-')


Solution 2:

You are getting blank output because you are checking whether a line matches a bunch of whitespace (\s{81}\n) and if there is a match, you are writing only that (blank) line. You need to instead print each line as it is read, and then jump to a new file when your pattern matches.

Also, when you use for line in f, the \n character is stripped out, so your regex will not match.

import re

delimiter_pattern = re.compile(r"\s{81}")

with open("Junctions.txt", "r") as f:
    fileNum = 1
    output = open(f'{fileNum}.txt','w') # f-strings require Python 3.6 but are cleaner
    for line in f:
        if not delimiter_pattern.match(line):
            output.write(line)
        else:
            output.close()
            fileNum += 1
            output = open(f'{fileNum}.txt','w')

    # Close last file
    if not output.closed:
      output.close()

Solution 3:

A few things.

  1. The single text file is being produced since you do not open a file for writing in the loop, you open one single one before the loop begins.

  2. Based on your desired output, you do not want to match the regular expression on each line, but rather you want to continue reading the file until you obtain a single record.

I have put together a working solution

with open("Junctions.txt", "r") as f:
        #read file and split on 80 spaces followed by new line
        file = f.read()
        sep = " " * 80 + "\n"
        chunks = file.split(sep)

        #for each chunk of the file write to a txt file
        i = 0
        for chunk in chunks:
            with open('%d.txt' % i, 'w') as outFile:
                outFile.write(chunk)
            i += 1

this will take the file and get a list of all the groups you want by finding the one separator (80 spaces followed by new line)


Solution 4:

\s captures spaces and newline, so it's 80 spaces plus one newline to get {81}. You can't get a second newline when iterating line-by-line, for line in f, unless you put in extra logic to account for that. Also, match() returns None, not False.

#! /usr/bin/env python3
import re

delimiter_pattern = re .compile( r'\s{81}' )

with open( 'Junctions.txt', 'r' ) as f:
    i = 1
    for line in f:
        if delimiter_pattern .match( line ) == None:
            output = open( f'{i}.txt', 'a+' )
            output .write( line )
        else:
            i += 1

Post a Comment for "How To Split A Text File Into Smaller Files Based On Regex Pattern?"