Find Duplicate Value From Two Files Using Python 3

As most programmer mention that Python is an easy language to learn.

I agree with that point. Python has been a charm to learn and code in. With the rich library, awesome documentation, active development and a huge community.

Python is, even more, easier to learn. In this blog post, I want to show how easy it is to find duplicate value from two files using Python 3.

You don’t even need any external libraries for this case and with less than 30 lines of code. Also, check my other python tutorials.

Let’s Go!!

You can find the full code at end of the post.

Reading The First File

For this tutorial, we will assume that the first text file is your main file. This file will contain all the data and from which you will check if it has data that you want to check.

In code, this file will be referred to as “first_file.txt” and it has one data per line.

# Read First File

print('Reading First File \n')

 

with open('first_file.txt','r') as file_one:

    data_one = file_one.readlines()

 

We use the “with” method to open the file as this will close the file automatically when the block of code finishes executing.

If you are writing some complex code, this method would be better because you might forget to close the file.

Also, note we open the file as read-only with “r” mode. You can read more about the modes in python documentation. 

 

This would mean we will not be able to write to this file as we might accidentally do it.

 

Next, we are reading the file with the method “readlines”. With this method, each line of the file will be stored as List in variable “data_one”.

Now with the list, it will easier for us to do iteration and compare the data.

Reading The Second File

Similar to the First File we read above, we will be reading the second file.

#Read Second File

print('Reading Second File \n')

with open('second_file.txt','r') as file_two:

    data_two = file_two.readlines()

Find The Duplicate Value From Two File

We have read both file and their data is stored as a list in two variables “data_one” and “data_two”.

 

As mentioned above storing as a list, it will be easy for us to do the iteration. Let’s see that in code.

#Compare the data of two files

for file_one_data in data_one:

    for file_two_data in data_two:

        # Check if data match

        if file_one_data.strip() == file_two_data.strip():

             # Display Duplicate entry here

 

Loops! We use for loops to go through each data of the list. First, we will take data from the first file’s output. Then we will check that data with all the data of file two.

Note that, for each data from the first list, it will go through all the data of list two.

Inside the second loop, we will use a conditional statement to check if two data match. If it matches, we got out duplicate or similar data from two file.

If you see the code, we are using “strip()” method of String. This will remove any spaces, tabs or newline attached to the data.

 

This is optional if you are sure that there are no spaces. For example “test “ is not equal to “test”, see the extra space at end in former data?

Writing Duplicate Value To File

We have learned to get the duplicate data. Now let us write it to a third file.

# Write similar data to a file

print('Write Similar Data to File \n')

with open('output_file.txt', 'w') as file_out:

    #Compare the data of two files

    for file_one_data in data_one:

        for file_two_data in data_two:

            # Check if data match

            if file_one_data.strip() == file_two_data.strip():

                # Write the duplicate data to file

                file_out.write(file_two_data.strip()+'\n')

Again we are using “with” method to open the file. As you can see we are using the mode of the file as “w”, which means write to the file.

 

Using this model, the file will be created if it does not exist and the file will be overwritten every time you run the code.

 

 Check Python documentation for a different type of mode.

Then, we put the loop for finding duplicate data. This way we can write the output to file directly.

Alternatively, you can save the output in a list and then write to file. The choice is yours.

In the last line of the code, we use the “write” method to write to the file.

Also, you can see we are appending “newline” to the data. This is done so that each duplicate value is written on the new line.

Complete Code

Here is the complete code and how your file will look. Save the python file and data file in the same directory.

#/usr/bin/python3 python3

print('== START OF PROGRAM == \n')

# Read First File

print('Reading First File \n')

with open('first_file.txt','r') as file_one:

    data_one = file_one.readlines()

# Read Second File

print('Reading Second File \n')

with open('second_file.txt','r') as file_two:

    data_two = file_two.readlines()

# Write similar data to a file

print('Write Similar Data to File \n')

with open('output_file.txt', 'w') as file_out:

    #Compare the data of two files

    for file_one_data in data_one:

        for file_two_data in data_two:

            # Check if data match

            if file_one_data.strip() == file_two_data.strip():

                # Write the duplicate data to file

                file_out.write(file_two_data.strip()+'\n')

print('== END OF PROGRAM ==')

Conclusion

The Python language is comparatively easy to use. With the minimum lines of code, you could find duplicate value from two files using Python 3.

We could achieve it with less than 30 lines of code and no external python library.

This method might not be the best if you have two files to compare with million lines of data.

But this is good enough for data with up to the tens of thousands lines of data.

I tried with around a hundred thousand data in two files and could execute it in 5 to 10 minutes time on standard PC.

Is there a better and faster way to process the data? Let me know in the comments below.

 

Previous Post Next Post