As most programmer mention that
Python is an easy language to learn.
I agree with that point. Python
has been a charm to learn and code in. With the rich library, awesome
documentation, active development and a huge community.
Python is, even more, easier to
learn. In this blog post, I want to show how easy it is to find duplicate value
from two files using Python 3.
You don’t even need any external libraries for this case and
with less than 30 lines of code. Also, check
my other python tutorials.
Let’s Go!!
You can find the full code at end of the post.
Reading The First File
For this tutorial, we will assume that the first text file is your main file. This file will contain all the data and from which you will check if it has data that you want to check.
In code, this file will be referred to as
“first_file.txt” and it has one data per line.
# Read First File
print('Reading First File \n')
with open('first_file.txt','r') as file_one:
data_one
= file_one.readlines()
We use the “with” method to
open the file as this will close the file automatically when the block of code
finishes executing.
If you are writing some complex
code, this method would be better because you might forget to close the file.
Also, note we open the file as read-only with “r” mode. You can
read more about the modes in python
documentation.
This would mean we will not be able to write to this file as we
might accidentally do it.
Next,
we are reading the file with the method “readlines”. With this method, each
line of the file will be stored as List in variable “data_one”.
Now with the list, it will
easier for us to do iteration and compare the data.
Reading The
Second File
Similar
to the First File we read above, we will be reading the second file.
#Read Second File
print('Reading Second File \n')
with open('second_file.txt','r') as file_two:
data_two
= file_two.readlines()
Find The Duplicate Value From Two File
We have read both file and their data is stored as a list in two
variables “data_one” and “data_two”.
As
mentioned above storing as a list, it will be easy for us to do the iteration.
Let’s see that in code.
#Compare the data of two files
for file_one_data in data_one:
for
file_two_data in data_two:
#
Check if data match
if
file_one_data.strip() == file_two_data.strip():
# Display Duplicate entry here
Loops! We use for loops to go
through each data of the list. First, we will take data from the first file’s
output. Then we will check that data with all the data of file two.
Note
that, for each data from the first list, it will go through all the data of
list two.
Inside
the second loop, we will use a conditional statement to check if two data
match. If it matches, we got out duplicate or similar data from two file.
If you see the code, we are using “strip()” method
of String. This will remove any spaces, tabs or newline attached to the data.
This is optional if you are sure that there are no spaces. For
example “test “ is not equal to “test”, see
the extra space at end in former data?
Writing
Duplicate Value To File
We
have learned to get the duplicate data. Now let us write it to a third file.
# Write similar data to a file
print('Write Similar Data to File \n')
with open('output_file.txt', 'w') as file_out:
#Compare
the data of two files
for
file_one_data in data_one:
for
file_two_data in data_two:
# Check if data match
if file_one_data.strip() == file_two_data.strip():
# Write the duplicate data to file
file_out.write(file_two_data.strip()+'\n')
Again we are using “with” method to open the file. As you can
see we are using the mode of the file as “w”, which means write to the file.
Using this model, the file will be created if it does not exist
and the file will be overwritten every time you run the code.
Check Python
documentation for a different type of mode.
Then,
we put the loop for finding duplicate data. This way we can write the output to
file directly.
Alternatively, you can save the
output in a list and then write to file. The choice is yours.
In the last line of the code, we use the “write” method
to write to the file.
Also, you can see we are appending “newline” to
the data. This is done so that each duplicate value is written on the new line.
Complete Code
Here
is the complete code and how your file will look. Save the python file and data
file in the same directory.
#/usr/bin/python3 python3
print('== START OF PROGRAM == \n')
# Read First File
print('Reading First File \n')
with open('first_file.txt','r') as file_one:
data_one
= file_one.readlines()
# Read Second File
print('Reading Second File \n')
with open('second_file.txt','r') as file_two:
data_two
= file_two.readlines()
# Write similar data to a file
print('Write Similar Data to File \n')
with open('output_file.txt', 'w') as file_out:
#Compare
the data of two files
for
file_one_data in data_one:
for
file_two_data in data_two:
# Check if data match
if file_one_data.strip() ==
file_two_data.strip():
# Write the duplicate data to file
file_out.write(file_two_data.strip()+'\n')
print('== END OF PROGRAM ==')
Conclusion
The
Python language is comparatively easy to use. With the minimum lines of code,
you could find duplicate
value from two files using Python
3.
We could achieve it with less
than 30 lines of code and no external python library.
This
method might not be the best if you have two files to compare with million
lines of data.
But this is good enough for
data with up to the tens of thousands lines of data.
I tried with around a hundred
thousand data in two files and could execute it in 5 to 10 minutes time on
standard PC.
Is
there a better and faster way to process the data? Let me know in the comments
below.