Read File and Put Line as List in Dictionary
How to extract specific portions of a text file using Python
Updated: 06/30/2020 by Reckoner Hope
Extracting text from a file is a common chore in scripting and programming, and Python makes information technology like shooting fish in a barrel. In this guide, nosotros'll discuss some simple means to extract text from a file using the Python 3 programming language.
Make sure you're using Python iii
In this guide, we'll be using Python version 3. Most systems come pre-installed with Python 2.7. While Python 2.7 is used in legacy code, Python iii is the present and hereafter of the Python language. Unless you lot accept a specific reason to write or support Python 2, we recommend working in Python three.
For Microsoft Windows, Python three can be downloaded from the Python official website. When installing, brand sure the "Install launcher for all users" and "Add Python to PATH" options are both checked, every bit shown in the image beneath.
On Linux, you can install Python three with your parcel director. For instance, on Debian or Ubuntu, you tin install it with the following command:
sudo apt-become update && sudo apt-get install python3
For macOS, the Python 3 installer can be downloaded from python.org, as linked higher up. If you lot are using the Homebrew bundle managing director, it can also exist installed by opening a terminal window (Applications → Utilities), and running this control:
brew install python3
Running Python
On Linux and macOS, the command to run the Python 3 interpreter is python3. On Windows, if you installed the launcher, the control is py. The commands on this page use python3; if you lot're on Windows, substitute py for python3 in all commands.
Running Python with no options starts the interactive interpreter. For more information nearly using the interpreter, see Python overview: using the Python interpreter. If yous accidentally enter the interpreter, y'all can exit it using the command go out() or quit().
Running Python with a file name volition interpret that python program. For case:
python3 program.py
...runs the program contained in the file programme.py.
Okay, how can we use Python to excerpt text from a text file?
Reading data from a text file
Starting time, let's read a text file. Let's say we're working with a file named lorem.txt, which contains lines from the Lorem Ipsum example text.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Annotation
In all the examples that follow, we piece of work with the four lines of text contained in this file. Copy and paste the latin text above into a text file, and save it as lorem.txt, so you can run the example code using this file equally input.
A Python program can read a text file using the built-in open() function. For instance, the Python three program below opens lorem.txt for reading in text fashion, reads the contents into a string variable named contents, closes the file, and prints the data.
myfile = open("lorem.txt", "rt") # open up lorem.txt for reading text contents = myfile.read() # read the unabridged file to string myfile.close() # shut the file print(contents) # print string contents
Here, myfile is the name we requite to our file object.
The "rt" parameter in the open() function means "nosotros're opening this file to read text data"
The hash mark ("#") means that everything on that line is a comment, and it'southward ignored by the Python interpreter.
If you save this program in a file called read.py, you lot tin can run it with the post-obit command.
python3 read.py
The command above outputs the contents of lorem.txt:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit down amet pretium tellus. Quisque at dignissim lacus.
Using "with open"
It's of import to shut your open files as soon as possible: open up the file, perform your functioning, and shut it. Don't leave it open for extended periods of time.
When y'all're working with files, information technology'south adept practice to utilise the with open up...as compound statement. It'southward the cleanest mode to open a file, operate on it, and shut the file, all in one easy-to-read block of lawmaking. The file is automatically closed when the code cake completes.
Using with open...as, we tin rewrite our program to await like this:
with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text contents = myfile.read() # Read the entire file to a string impress(contents) # Print the string
Notation
Indentation is important in Python. Python programs apply white space at the beginning of a line to define scope, such as a block of code. We recommend you use four spaces per level of indentation, and that you utilise spaces rather than tabs. In the following examples, make sure your code is indented exactly every bit it's presented here.
Example
Save the program equally read.py and execute it:
python3 read.py
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit down amet pretium tellus. Quisque at dignissim lacus.
Reading text files line-by-line
In the examples and then far, nosotros've been reading in the whole file at one time. Reading a full file is no big bargain with small files, only generally speaking, information technology'southward not a great idea. For one thing, if your file is bigger than the amount of available memory, you'll encounter an error.
In almost every case, it's a ameliorate thought to read a text file one line at a fourth dimension.
In Python, the file object is an iterator. An iterator is a blazon of Python object which behaves in certain means when operated on repeatedly. For instance, y'all can utilize a for loop to operate on a file object repeatedly, and each fourth dimension the same operation is performed, you'll receive a different, or "next," result.
Example
For text files, the file object iterates one line of text at a time. It considers one line of text a "unit of measurement" of data, so we can apply a for...in loop statement to iterate one line at a fourth dimension:
with open ('lorem.txt', 'rt') every bit myfile: # Open lorem.txt for reading for myline in myfile: # For each line, read to a cord, print(myline) # and impress the string.
Output:
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit down amet pretium tellus. Quisque at dignissim lacus.
Detect that we're getting an extra line break ("newline") after every line. That's because 2 newlines are existence printed. The first i is the newline at the finish of every line of our text file. The 2nd newline happens because, by default, print() adds a linebreak of its own at the finish of whatever you've asked information technology to impress.
Let'due south shop our lines of text in a variable — specifically, a list variable — so nosotros tin can look at information technology more than closely.
Storing text data in a variable
In Python, lists are similar to, merely non the aforementioned equally, an assortment in C or Java. A Python list contains indexed information, of varying lengths and types.
Example
mylines = [] # Declare an empty listing named mylines. with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text data. for myline in myfile: # For each line, stored as myline, mylines.suspend(myline) # add its contents to mylines. print(mylines) # Impress the list.
The output of this program is a petty dissimilar. Instead of printing the contents of the list, this plan prints our list object, which looks like this:
Output:
['Lorem ipsum dolor sit amet, consectetur adipiscing elit.\due north', 'Nunc fringilla arcu congue metus aliquam mollis.\n', 'Mauris nec maximus purus. Maecenas sit down amet pretium tellus.\n', 'Quisque at dignissim lacus.\due north']
Here, we run into the raw contents of the listing. In its raw object form, a list is represented as a comma-delimited list. Hither, each chemical element is represented equally a string, and each newline is represented every bit its escape character sequence, \n.
Much similar a C or Java array, the listing elements are accessed by specifying an alphabetize number later on the variable proper noun, in brackets. Alphabetize numbers start at zip — other words, the nth element of a list has the numeric alphabetize n-1.
Note
If you lot're wondering why the index numbers showtime at zero instead of i, you're not alone. Computer scientists have debated the usefulness of zero-based numbering systems in the past. In 1982, Edsger Dijkstra gave his stance on the subject, explaining why zero-based numbering is the all-time style to index data in informatics. You can read the memo yourself — he makes a compelling statement.
Example
We can print the first element of lines past specifying index number 0, contained in brackets subsequently the name of the list:
impress(mylines[0])
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.
Example
Or the tertiary line, past specifying index number 2:
print(mylines[2])
Output:
Quisque at dignissim lacus.
But if nosotros attempt to access an index for which in that location is no value, we get an error:
Example
impress(mylines[3])
Output:
Traceback (most recent call concluding): File <filename>, line <linenum>, in <module> print(mylines[3]) IndexError: list index out of range
Example
A listing object is an iterator, so to print every element of the list, we tin iterate over it with for...in:
mylines = [] # Declare an empty list with open up ('lorem.txt', 'rt') equally myfile: # Open up lorem.txt for reading text. for line in myfile: # For each line of text, mylines.append(line) # add that line to the list. for element in mylines: # For each element in the list, print(chemical element) # print information technology.
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Only we're nonetheless getting extra newlines. Each line of our text file ends in a newline character ('\n'), which is being printed. Besides, after printing each line, impress() adds a newline of its own, unless you lot tell information technology to do otherwise.
We tin modify this default behavior by specifying an cease parameter in our impress() call:
print(chemical element, end='')
By setting stop to an empty string (2 single quotes, with no space), we tell impress() to print nothing at the end of a line, instead of a newline character.
Example
Our revised program looks similar this:
mylines = [] # Declare an empty list with open up ('lorem.txt', 'rt') as myfile: # Open file lorem.txt for line in myfile: # For each line of text, mylines.append(line) # add that line to the list. for element in mylines: # For each element in the list, print(element, terminate='') # impress information technology without extra newlines.
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
The newlines you see here are actually in the file; they're a special character ('\north') at the stop of each line. We desire to get rid of these, then we don't have to worry about them while we procedure the file.
How to strip newlines
To remove the newlines completely, we can strip them. To strip a cord is to remove one or more than characters, usually whitespace, from either the get-go or end of the string.
Tip
This process is sometimes as well called "trimming."
Python iii string objects have a method chosen rstrip(), which strips characters from the right side of a string. The English language reads left-to-right, so stripping from the correct side removes characters from the finish.
If the variable is named mystring, we can strip its right side with mystring.rstrip(chars), where chars is a string of characters to strip. For instance, "123abc".rstrip("bc") returns 123a.
Tip
When yous correspond a cord in your program with its literal contents, it's called a string literal. In Python (as in most programming languages), string literals are always quoted — enclosed on either side by single (') or double (") quotes. In Python, single and double quotes are equivalent; you can utilise one or the other, as long as they match on both ends of the string. Information technology'southward traditional to stand for a human-readable cord (such as Hello) in double-quotes ("Hello"). If you're representing a single character (such as b), or a unmarried special character such as the newline character (\n), it'due south traditional to use unmarried quotes ('b', '\n'). For more information about how to use strings in Python, you tin can read the documentation of strings in Python.
The statement string.rstrip('\n') volition strip a newline character from the right side of string. The following version of our program strips the newlines when each line is read from the text file:
mylines = [] # Declare an empty listing. with open up ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text. for myline in myfile: # For each line in the file, mylines.append(myline.rstrip('\n')) # strip newline and add to list. for element in mylines: # For each element in the list, print(element) # print it.
The text is now stored in a list variable, so individual lines tin be accessed by index number. Newlines were stripped, then we don't accept to worry about them. We can always put them dorsum later on if we reconstruct the file and write it to deejay.
At present, let'southward search the lines in the list for a specific substring.
Searching text for a substring
Let's say we want to locate every occurrence of a certain phrase, or fifty-fifty a single alphabetic character. For case, possibly nosotros demand to know where every "e" is. We tin can accomplish this using the string's find() method.
The list stores each line of our text as a string object. All string objects have a method, find(), which locates the first occurrence of a substrings in the string.
Let'south use the detect() method to search for the letter of the alphabet "e" in the first line of our text file, which is stored in the list mylines. The showtime chemical element of mylines is a string object containing the first line of the text file. This string object has a find() method.
In the parentheses of notice(), we specify parameters. The outset and but required parameter is the cord to search for, "e". The statement mylines[0].find("e") tells the interpreter to search forrad, starting at the beginning of the string, one character at a time, until it finds the alphabetic character "eastward." When it finds ane, it stops searching, and returns the alphabetize number where that "e" is located. If it reaches the end of the string, it returns -1 to indicate nothing was found.
Example
print(mylines[0].find("eastward"))
Output:
three
The return value "3" tells usa that the letter of the alphabet "east" is the fourth graphic symbol, the "e" in "Lorem". (Retrieve, the alphabetize is zero-based: index 0 is the offset character, 1 is the second, etc.)
The find() method takes two optional, additional parameters: a start index and a terminate index, indicating where in the cord the search should begin and finish. For case, cord.observe("abc", 10, 20) searches for the substring "abc", just but from the 11th to the 21st character. If stop is not specified, find() starts at index get-go, and stops at the end of the string.
Example
For instance, the following statement searchs for "due east" in mylines[0], first at the fifth character.
print(mylines[0].notice("e", 4))
Output:
24
In other words, starting at the 5th character in line[0], the offset "eastward" is located at index 24 (the "e" in "nec").
Example
To start searching at index 10, and stop at index 30:
impress(mylines[i].detect("e", 10, 30))
Output:
28
(The first "e" in "Maecenas").
If find() doesn't locate the substring in the search range, it returns the number -1, indicating failure:
print(mylines[0].observe("e", 25, 30))
Output:
-i
There were no "e" occurrences betwixt indices 25 and 30.
Finding all occurrences of a substring
Simply what if we want to locate every occurrence of a substring, non just the commencement one nosotros run into? We can iterate over the string, starting from the index of the previous friction match.
In this example, we'll use a while loop to repeatedly detect the letter "east". When an occurrence is found, we call find again, starting from a new location in the string. Specifically, the location of the last occurrence, plus the length of the string (so we tin can move forward by the last one). When find returns -i, or the offset index exceeds the length of the string, we stop.
# Build assortment of lines from file, strip newlines mylines = [] # Declare an empty listing. with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text. for myline in myfile: # For each line in the file, mylines.append(myline.rstrip('\northward')) # strip newline and add together to listing. # Locate and print all occurences of letter "due east" substr = "e" # substring to search for. for line in mylines: # cord to be searched index = 0 # current index: character being compared prev = 0 # previous index: last character compared while index < len(line): # While alphabetize has not exceeded string length, index = line.find(substr, index) # set index to first occurrence of "e" if index == -1: # If cipher was found, intermission # exit the while loop. print(" " * (index - prev) + "eastward", terminate='') # impress spaces from previous # friction match, and so the substring. prev = index + len(substr) # remember this position for adjacent loop. index += len(substr) # increment the alphabetize by the length of substr. # (Repeat until index > line length) print('\n' + line); # Impress the original cord under the due east's
Output:
east eastward eastward e due east Lorem ipsum dolor sit amet, consectetur adipiscing elit. e eastward Nunc fringilla arcu congue metus aliquam mollis. east e e e eastward e Mauris nec maximus purus. Maecenas sit amet pretium tellus. eastward Quisque at dignissim lacus.
Incorporating regular expressions
For circuitous searches, utilize regular expressions.
The Python regular expressions module is called re. To utilise it in your program, import the module before you lot use information technology:
import re
The re module implements regular expressions by compiling a search pattern into a pattern object. Methods of this object can so be used to perform match operations.
For example, let's say you desire to search for whatever word in your certificate which starts with the letter d and ends in the letter r. Nosotros can reach this using the regular expression "\bd\westward*r\b". What does this mean?
character sequence | meaning |
---|---|
\b | A word purlieus matches an empty cord (anything, including nothing at all), but just if it appears before or after a non-word grapheme. "Word characters" are the digits 0 through 9, the lowercase and uppercase letters, or an underscore ("_"). |
d | Lowercase letter of the alphabet d. |
\w* | \w represents any word graphic symbol, and * is a quantifier significant "zippo or more of the previous character." And then \due west* will match zero or more word characters. |
r | Lowercase alphabetic character r. |
\b | Give-and-take boundary. |
Then this regular expression will friction match any string that can exist described every bit "a word boundary, then a lowercase 'd', then goose egg or more word characters, then a lowercase 'r', and so a word boundary." Strings described this way include the words destroyer, dour, and doctor, and the abbreviation dr.
To apply this regular expression in Python search operations, nosotros first compile information technology into a pattern object. For instance, the following Python argument creates a design object named pattern which nosotros can use to perform searches using that regular expression.
pattern = re.compile(r"\bd\w*r\b")
Notation
The letter r before our cord in the statement above is of import. It tells Python to interpret our cord every bit a raw string, exactly every bit nosotros've typed it. If we didn't prefix the string with an r, Python would interpret the escape sequences such as \b in other ways. Whenever you lot need Python to interpret your strings literally, specify information technology as a raw string by prefixing it with r.
Now we can utilise the pattern object's methods, such as search(), to search a string for the compiled regular expression, looking for a match. If it finds i, it returns a special result called a match object. Otherwise, it returns None, a built-in Python abiding that is used similar the boolean value "faux".
import re str = "Skillful morning time, doctor." pat = re.compile(r"\bd\w*r\b") # compile regex "\bd\w*r\b" to a blueprint object if pat.search(str) != None: # Search for the pattern. If institute, print("Institute it.")
Output:
Found it.
To perform a case-insensitive search, yous can specify the special constant re.IGNORECASE in the compile step:
import re str = "Hello, Doc." pat = re.compile(r"\bd\w*r\b", re.IGNORECASE) # upper and lowercase will friction match if pat.search(str) != None: print("Found information technology.")
Output:
Constitute it.
Putting it all together
So now we know how to open a file, read the lines into a listing, and locate a substring in any given list chemical element. Let's utilize this knowledge to build some example programs.
Print all lines containing substring
The program below reads a log file line by line. If the line contains the discussion "error," it is added to a list called errors. If non, it is ignored. The lower() cord method converts all strings to lowercase for comparison purposes, making the search case-insensitive without altering the original strings.
Annotation that the find() method is chosen directly on the result of the lower() method; this is called method chaining. Also, note that in the print() statement, we construct an output string past joining several strings with the + operator.
errors = [] # The listing where we will store results. linenum = 0 substr = "error".lower() # Substring to search for. with open up ('logfile.txt', 'rt') every bit myfile: for line in myfile: linenum += 1 if line.lower().discover(substr) != -1: # if case-insensitive lucifer, errors.append("Line " + str(linenum) + ": " + line.rstrip('\northward')) for err in errors: print(err)
Input (stored in logfile.txt):
This is line 1 This is line 2 Line three has an fault! This is line iv Line 5 also has an error!
Output:
Line 3: Line three has an mistake! Line 5: Line 5 also has an error!
Extract all lines containing substring, using regex
The plan below is similar to the above programme, only using the re regular expressions module. The errors and line numbers are stored as tuples, e.thou., (linenum, line). The tuple is created past the additional enclosing parentheses in the errors.append() statement. The elements of the tuple are referenced similar to a listing, with a zero-based index in brackets. As constructed hither, err[0] is a linenum and err[1] is the associated line containing an error.
import re errors = [] linenum = 0 design = re.compile("fault", re.IGNORECASE) # Compile a instance-insensitive regex with open ('logfile.txt', 'rt') equally myfile: for line in myfile: linenum += ane if blueprint.search(line) != None: # If a match is found errors.suspend((linenum, line.rstrip('\due north'))) for err in errors: # Iterate over the listing of tuples impress("Line " + str(err[0]) + ": " + err[1])
Output:
Line 6: Mar 28 09:10:37 Fault: cannot contact server. Connection refused. Line 10: Mar 28 10:28:xv Kernel error: The specified location is non mounted. Line xiv: Mar 28 11:06:30 Mistake: usb 1-1: can't prepare config, exiting.
Extract all lines containing a phone number
The program below prints whatsoever line of a text file, info.txt, which contains a US or international phone number. It accomplishes this with the regular expression "(\+\d{ane,two})?[\s.-]?\d{3}[\s.-]?\d{iv}". This regex matches the following phone number notations:
- 123-456-7890
- (123) 456-7890
- 123 456 7890
- 123.456.7890
- +91 (123) 456-7890
import re errors = [] linenum = 0 pattern = re.compile(r"(\+\d{ane,two})?[\s.-]?\d{3}[\s.-]?\d{4}") with open up ('info.txt', 'rt') as myfile: for line in myfile: linenum += 1 if pattern.search(line) != None: # If pattern search finds a friction match, errors.append((linenum, line.rstrip('\n'))) for err in errors: print("Line ", str(err[0]), ": " + err[1])
Output:
Line 3 : My phone number is 731.215.8881. Line vii : You can attain Mr. Walters at (212) 558-3131. Line 12 : His agent, Mrs. Kennedy, can be reached at +12 (123) 456-7890 Line fourteen : She can also be contacted at (888) 312.8403, extension 12.
Search a dictionary for words
The program below searches the dictionary for whatever words that commencement with h and end in pe. For input, it uses a dictionary file included on many Unix systems, /usr/share/dict/words.
import re filename = "/usr/share/dict/words" pattern = re.compile(r"\bh\w*pe$", re.IGNORECASE) with open(filename, "rt") equally myfile: for line in myfile: if design.search(line) != None: impress(line, end='')
Output:
Hope heliotrope promise hornpipe horoscope hype
hildebrandfirig1972.blogspot.com
Source: https://www.computerhope.com/issues/ch001721.htm
0 Response to "Read File and Put Line as List in Dictionary"
Post a Comment