Pandas Read Txt File Into Different Columns

March 02, 2022 Post a Comment

Pandas reads the official documentation provided by the file

Before using pandas to read files, the necessary content must be the official document, the official certificate access accost

http://pandas.pydata.org/pandas-docs/version/0.24/reference/io.html

The document operation belongs to Input/Output in pandas, which is as well the IO operation. The basic API is in the above URL. Next, the core of this article will let you understand some commonly used commands

pandas read txt file

Reading the txt file requires determining whether the txt file conforms to the basic format, that is, whether there are special separators such as \\t, `, , `, etc. >
The general txt file grows similar this

Example of a txt file

The following file is a space interval

          1 2019-03-22 00:06:24.4463094 Chinese test
ii 2019-03-22 00:06:32.4565680 Demand to edit encoding
iii 2019-03-22 00:06:32.6835965 ashshsh
4 2017-03-22 00:06:32.8041945 eggg

Read commands tin exist read_csv or read_table.

          import pandas as pd
Df = pd.read_table("./test.txt")
Impress(df)Import pandas every bit pd
Df = pd.read_csv("./test.txt")
Print(df)

Nonetheless, note that the data content read in this place is a DataFrame type of 3 rows and one column, and does not get 3 rows and 4 columns according to our requirements

          import pandas as pd
Df = pd.read_csv("./test.txt")
Print(type(df))
Print(df.shape)<class &39;pandas.cadre.frame.DataFrame&39;>
(3, 1)

read_csv function

Default: Loads delimited information from files, URLs, file new objects, and the default separator is a comma.

The in a higher place txt documents are non separated by commas, then you need to increase the sep separator parameter when reading

          df = pd.read_csv("./test.txt",sep=&39; &39;)

Parameter description, official Source: https://github.com/pandas-dev/pandas/blob/v0.24.0/pandas/io/parsers.pyL531-L697

Chinese description and primal feature cases

is specified

parameters	Chinese interpretation
filepath_or_buffer	tin be a URL, available URL types include: http, ftp, s3 and files, local file read case: file://localhost/ Path/to/tabular array.csv
sep	str type, default ',' specifies the separator. If yous do not specify a parameter, yous will try to divide them with a comma using the default value. The separator is longer than i character and is non '\\s+' and will use the Python parser. And ignore the comma in the data. Regular expression example: '\\r\\t'
~~delimiter~~	Delimiter, alternate delimiter (if this parameter is specified, the sep parameter is invalid) By and large not used
~~delimiter_whitespace~~	True or False Default Fake, using a infinite every bit a separator is equivalent to spe='\\s+'. If this parameter is chosen, delimite will non work
header	Specify the first few lines every bit cavalcade names (ignoring the comment lines), if no column names are specified, the default header=0; if the column name header=None
names	Specify the column proper name. If the file does non incorporate a header row, it should expressively signal header=None , and the header can be a list of integers, such equally [0, ane, 3]. Unspecified intermediate rows will exist deleted (for example, skip 2 rows in this instance)
index_col(instance 1)	The default is None. Use the column name equally the row label for the DataFrame. If a sequence is given, use MultiIndex. If you read a file, the file has a delimiter at the finish of each line. Consider using index_col=Fake to brand panadas use the offset column as the name of the line.
usecols	Default None can apply column sequences or column names such every bit [0, 1, two] or ['foo', 'bar', 'baz'] , apply this Parameters can speed upwardly loading and reduce memory consumption.
squeeze	The default is False. In the example of True, the returned type is Serial. If the data is parsed and contains but one line, render Series#td>
prefix	The prefix of the automatically generated column proper noun number, such as: 'X' for X0, X1,... is valid when header =None or if no header is set.
mangle_dupe_cols	The default is Truthful, and the repeated columns will be designated as 'Ten.0'...'X.North' instead of 'X'...'10'. If you pass in Imitation, there will be duplicate names in the column, which will cause the data to be overwritten.
dtype	Example: {'a': np.float64, 'b': np.int32} Specify the information type for each column, a, b for the column name
engine	The analytics engine used. Yous can choose C or Python, the C engine is faster but the Python engine has more than features.
converters (instance two)	Set the processing part of the specified column, yous can use the "quote number" or "cavalcade name" for column designation
true_values/ false_values	The actual application scenario was non plant, Remarks, post-improvement
skipinitialspace	Ignore spaces after delimiters, default faux
skiprows	Defaults None The number of rows to ignore (from the beginning of the file), or a list of line numbers to skip (starting at 0)
skipfooter	Ignore from the end of the file. (c engine does not support)
nrows	How many rows of data are read from the file, the number of rows that need to be read (from the beginning of the file header)
na_values	null definition, past default, 'Due north/A', 'N/AN/A', 'NA', '-i.IND', '-ane. QNAN', '-NaN', '-nan', '1.IND', 'one.QNAN', 'N/A', 'NA', 'Nada', 'NaN', 'n/a', ' Nan', 'zip'. Both behave as NAN
keep_default_na	If the na_values parameter is specified and keep_default_na=Simulated, the default NaN volition be overwritten, otherwise add together
na_filter	Check for missing values (empty strings or null values). For large files, there is no Northward/A nix in the dataset, using na_filter=False tin can improve read speed.
verbose	Print out the output information of various parsers, for example: "Number of missing values in non-numeric columns".
skip_blank_lines	If True, skip blank lines; otherwise, tape as NaN.
parse_dates	There are the following operations 1. boolean. True -> parsing index ii. list of ints or names. eg If [1, 2, 3] -> Parse the values of columns ane, 2, and 3 equally separate engagement columns; iii. list of lists. eg If [[one, 3]] -> Combine 1, 3 columns as a date column < Br> 4. dict, eg {'foo' : [one, three]} -> Combine the i,iii columns and requite the merged column a proper name of "foo"
infer_datetime_format	If gear up to Truthful and parse_dates is available, then pandas will attempt to convert to a date type, and if information technology can be converted, convert the method and parse information technology. In some cases it will be 5 to x times faster
keep_date_col	If you join multiple columns to parse dates, continue the columns participating in the join. The default is False
date_parser	The function used to parse the date. By default, dateutil.parser.parser is used for the conversion. Pandas tries to parse in three dissimilar ways, and if you accept a problem, apply the next method. 1. Use ane or more than arrays (specified by parse_dates) equally parameters; two. Connect the specified multi-cavalcade string every bit a column every bit a parameter; iii. Call the date_parser function once per line to parse a Or multiple strings (specified by parse_dates) as arguments.
dayfirst	date blazon in DD/MM format
iterator	Returns a TextFileReader object to procedure the file block by block.
chunksize	File block size
compression	Directly use compressed files on disk. If you use the infer parameter, employ gzip, bz2, zero or unzip the file with the suffix ".gz", '.bz2', '.zip', or 'xz'. Otherwise, do non extract it. If you employ aught, the ZIP package China must incorporate only i file. Set to None to not decompress.
The new version 0.18.i supports goose egg and xz decompression
thousands	Thousands of symbols, default ','
decimal	decimal point symbol, default '.'
lineterminator	Line separator, only used under C parser
quotechar	Quotation marks, used as characters to place the beginning and interpretation, the separator within the quotes will be ignored
quoting	Command the quotes constants in csv. Optional QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3)
doublequote	Double quotes, when single quotes take been defined, and the quoted argument is not QUOTE_NONE, use double quotes to indicate that the element inside the quote is used every bit an chemical element.
escapechar	When pricing is QUOTE_NONE, specify a character so that it is not subject area to the separator limit.
comment	Identifies that extra lines are not parsed. If the character appears at the get-go of the line, this line will be ignored. This parameter can only be ane character. Blank lines (like skip_blank_lines=True) comment lines are ignored past header and skiprows. For case, if you specify comment='' to parse 'empty\ a,b,c\ i,2,three' with header=0 then the issue will be 'a,b,c' as the header
encoding	Encoding method, specifying the character set type, usually specified as 'utf-8'
dialect	If no specific language is specified, it is ignored if sep is greater than one character. Encounter the csv.Dialect documentation specifically
error_bad_lines	If a row contains as well many columns, the DataFrame will non be returned by default. If gear up to false, the row volition be rejected (but available nether C parser)
warn_bad_lines	If error_bad_lines =False and warn_bad_lines =True then all "bad lines" will be output (only bachelor nether C parser)
low_memory	The cake is loaded into memory and parsed in low memory consumption. However, type confusion may occur. Make sure the type is not confusing and y'all need to set it to False. Or use the dtype parameter to specify the type. Note that using the chunksize or iterator parameter to block reads will read the entire file into a Dataframe, while ignoring the type (tin can only be valid in the C parser)
delim_whitespace	New in version 0.18.ane: Valid in Python parser
memory_map	If a file path is provided for filepath_or_buffer, the file object is mapped directly to retentivity and the data is accessed directly from there. Utilise this option to better performance because there is no longer any I/O overhead, and this way you can avoid file IO operations once more
float_precision	Specify the converter that the C engine applies to floating point values

This department is part of the reference weblog https://world wide web.cnblogs.com/datablog/p/6127000.html Thanks to the blogger for the translation, O(∩_∩)O Haha~

Case ane

index_col use

Kickoff prepare a txt file. The biggest problem with this file is that there is a ',' at the terminate of each line. Explain as prompted. If there is a separator at the end of each line, in that location volition be issues, but it is found during the actual test. With the names parameter, the outcome can appear

          goof,one,ii,3,ddd,
u,1,iii,4,asd,
As,df,12,33,

Write the following lawmaking

          df = pd.read_csv("./demo.txt",header=None,names=[&39;a&39;,&39;b&39;,&39;c&39;,&39; d&39;,&39;east&39;])
Impress(df)Df = pd.read_csv("./demo.txt",header=None,index_col=False,names=[&39;a&39;,&39;b&39;,&39;c&39;,&39;d&39;,&39;e&39; ])
Impress(df)

Deep Understand pandas

In fact, the significance is not actually great, and the document may non limited his specific role. Let's talk virtually the mutual use of index_col

When reading a file, if the index_col column index is non set, the integer index starting from 0 is used by default. When y'all operate on a row or column of a table, you will discover that at that place will always be ane more than column starting at 0 when saving to a file. If you set the index_col parameter to prepare the column index, this problem will not occur.

Case 2

converters Fix the processing function of the specified column, you can utilize the "quote number" or "cavalcade name" for column designation

          import pandas as pdDef fun(x):
    Return str(10)+"-haha"
            Df = pd.read_csv("./examination.txt",sep=&39; &39;,header=None,index_col=0,converters={3:fun})
Print(type(df))
Print(df.shape)
Print(df)

pandas Deep understanding of read_csv

Common problems in the process of read_csv function

When some IDEs use Pandas' read_csv function to import data files, if the file path or file name contains Chinese, an error volition be reported.

Workaround

          import pandas as pd
Df=pd.read_csv(&39;F:/exam folder/test data.txt&39;)
f=open(&39;F:/test folder/test data.txt&39;)
Df=pd.read_csv(f)

Exclude some lines Use the parameter skiprows. Its role is to exclude a line.

Notation that the first 3 rows are skiprows=3. The third row is skiprows=[3]
For irregular separators, employ regular expression to read the file

The separator in the file is a space, then we only need to ready sep=" " to read the file. When the separator is non a single infinite, perhaps there is a infinite and there are multiple spaces. If yous still utilise sep=" " to read the file at this fourth dimension, you may get a very strange information, because it will Spaces are besides used every bit data.
```
                              information = pd.read_csv("information.txt",sep="\\s+")                          
```
If there is a Chinese encoding error in the file being read

Need to set the encoding parameter
Adding indexes to rows and columns

Add column index with parameter names, add row index with index_col

read_csv This control has a fair corporeality of arguments. Near are unnecessary because most of the files you download accept a standard format.

read_table part

The basic usage is the same, the difference is the separator separator.

Csv is a comma-separated value that only correctly reads data that is split with ",". read_table defaults to '\\t' (that is, tab) to cut the data fix

read_fwf function

Read files with fixed width columns, such as files

          id8141 360.242940 149.910199 11950.7
Id1594 444.953632 166.985655 11788.4
Id1849 364.136849 183.628767 11806.2
Id1230 413.836124 184.375703 11916.eight
Id1948 502.953953 173.237159 12468.iii

The read_fwf command has 2 additional parameters to gear up

colspecs :

You demand to requite a listing of tuples. The list of tuples is a semi-open up interval, [from, to). By default, it will infer from the first 100 rows of data.

Instance:

          import pandas as pd
Colspecs = [(0, half-dozen), (8, 20), (21, 33), (34, 43)]
Df = pd.read_fwf(&39;demo.txt&39;, colspecs=colspecs, header=None, index_col=0)

widths:

Use a width listing directly instead of the colspecs parameter

          widths = [6, 14, 13, 10]
Df = pd.read_fwf(&39;demo.txt&39;, widths=widths, header=None)

read_fwf is not used very frequently. Come across http://pandas.pydata.org/pandas-docs/stable/user_guide/io.htmlfiles-with-fixed-width-columns Learn

read_msgpack role

A new serializable data format supported past pandas, a lightweight portable binary format similar to binary JSON, which has loftier data space utilization and is written (serialized) Both the read and the read (deserialization) aspects provide proficient performance.

read_clipboard function

Read the data in the clipboard, which can be seen every bit the clipboard version of read_table. Useful when converting web pages to tables

The following Bug appears in this place

module 'pandas' has no attribute 'compat'

I updated the pandas and it works fine

python Deep understanding of pandas

There is also a place to compare the pits, when reading the clipboard, if you lot copy the Chinese, information technology is like shooting fish in a barrel to read the data

Solution

Open site-packages\\pandas\\io\\clipboard.py This file needs to be retrieved by yourself
Add together this to the line later text = clipboard_get() : text = text.decode('UTF-eight')
Salvage and yous're ready to use

read_excel office

Still is the official document first lawmaking: http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.htmlpandas.read_excel

parameters	Chinese interpretation
io	File form object, pandas Excel file or xlrd workbook. The cord may be a URL. URLs include http, ftp, s3 and files. For case, a local file can be written as file://localhost/path/to/workbook.xlsx
sheet_name	The default is that sheetname is 0, returning multiple tables using sheetname=[0,1], if sheetname=None is returning the full table. Annotation: int/string returns a dataframe, while none and list return a dict of dataframe, the table name is represented past a string, and the alphabetize table position is represented by an integer;
header	Specify the row every bit the column name. The default is 0, that is, accept the commencement row, the information is the information below the column name row; if the data does not contain the cavalcade proper noun, set the header = None;
names	Specify the proper name of the column, passing in a list of data
index_col	Specify the column equally an index cavalcade, or use u "strings". If you pass a listing, the columns will exist combined into a MultiIndex.
squeeze	If the parsed data contains only 1 column, return a Series
dtype	Data blazon of information or column, refer to read_csv
engine	If io is not a buffer or path, it must be set to identify io. Acceptable values are None or xlrd
converters	refer to read_csv
Other parameters	Bones and read_csv consistent

pandas read excel file if the fault is reported, the general processing is

Error is: ImportError: No module named 'xlrd'

Pandas reads the excel file and requires a separate xlrd module to back up pip install xlrd

read_json function

parameters	Chinese estimation
path_or_buf	A valid JSON file with a default value of None and a cord tin be a URL, such as file://localhost/path/to/table.json
orient (case 1)	The expected json string format, orient settings have the following values: 1. 'divide' : dict like {alphabetize -> [alphabetize], columns -> [columns], information -> [values]} 2. 'records' : list like [{column -> value},... , {column -> value}] 3. 'alphabetize' : dict like {alphabetize -> {column -> value}} 4. 'columns' : dict like {column -> {alphabetize -> value}} 5. 'values' : just the values array
typ	The format returned (series or frame), the default is 'frame'
dtype	Information type of data or column, refer to read_csv
convert_axes	boolean, endeavor to convert the axis to the correct dtypes, the default value is True
convert_dates	Listing of columns for parsing dates; if True, try parsing columns with similar dates, default is True Reference column tags it ends with '_at' , it ends with '_time', it begins with 'timestamp', information technology is 'modified', it is 'date'
keep_default_dates	boolean, default True. Parse the default appointment sample cavalcade if the date is resolved
numpy	Directly decodes into a numpy array. The default is Simulated; only numeric data is supported, only the characterization may be not-numeric. Also note that if numpy=True, JSON sorts MUST
precise_float	boolean, default Fake. Set up to enable the use of higher precision (strtod) functions when decoding strings to double values. The default (Fake) is a fast but less authentic born characteristic
date_unit	cord, the timestamp unit of measurement used to detect the conversion date. The default value is none. By default, the timestamp precision is detected, and if not required, the timestamp precision is forced to seconds, milliseconds, microseconds, or nanoseconds by i of 'southward', 'ms', 'united states of america', or 'ns', respectively.
encoding	json encoding
lines	Each line reads the file every bit a json object.

If the JSON is unresolvable, the parser will generate one of ValueError/TypeError/AssertionError.

Case 1

orient='split'

                      import pandas as pd
    southward = &39;{"index":[i,ii,3],"columns":["a","b"],"data":[[i,three],[ii, five],[half-dozen,ix]]}&39;
    Df = pd.read_json(south,orient=&39;carve up&39;)

orient='records'
Members are a dictionary listing

          import pandas as pd
due south = &39;[{"a":1,"b":2},{"a":three,"b":four}]&39;
Df = pd.read_json(due south,orient=&39;records&39;)

orient='index'

The alphabetize is the key, and the lexicon composed of the column fields is the key value. Such as:

                              south = &39;{"0":{"a":1,"b":2},"1":{"a":2,"b": four}}&39;

orient='columns' or values tin exist inferred past yourself

For some Chinese translations, please refer to github> https://github.com/apachecn/pandas-medico-zh

read_json()Common Problems

Reading json file appears ValueError: Abaft information , JSON format problem

The original format is

          {"a":i,"b":one},{"a":ii,"b":ii}

Arrange to

          [{"a":1,"b":1},{"a":ii,"b":2}]

Or use the lines parameter and conform JSON to one data per line

          {"a":ane,"b":i}
{"a":ii,"b":2}

If in that location is Chinese in the JSON file, it is recommended to add the encoding parameter and assign 'utf-8', otherwise it will study an error

read_html function

parameters	Chinese interpretation
io	Receive URLs, files, strings. The URL does not have https, endeavour to remove s and and then crawl to
match	Regular expression, returns a tabular array that matches the regular expression
flavor	The parser defaults to 'lxml'
header	Specify the row where the column header is located, list is a multiple index
index_col	Specify the cavalcade respective to the row header, list is a multiple index
skiprows	Skip the nth line (sequence mark) or skip n lines (integer marking)
attrs	backdrop, such as attrs = {'id': 'table'}
parse_dates	Parsing appointment

Use the method, right click on the webpage, if you detect the tabular array, that is, tabular array can be used

Instance: http://data.stcn.com/2019/0304/14899644.shtml

          <table class="..." id="...">
    <thead>
    <tr>
    <thursday>...</th>
    </tr>
    </thead>
    <tbody>
        <tr>
            <td>...</td>
        </tr>
        <tr>...</tr>
    </tbody>
</table><table> : ascertain the tabular array
<thead> : ascertain the header of the table
<tbody> : ascertain the body of the form
<tr> : define the row of the tabular array
<th> : ascertain the header of the table
<td> : define table cells

Mutual Bug

The following error occurred: ImportError: html5lib not institute, please install it

Install html5lib, or use parameters

          import pandas as pd
Df = pd.read_html("http://data.stcn.com/2019/0304/14899644.shtml",flavour =&39;lxml&39;)

For more than reference source lawmaking, see > http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Catastrophe

As of now, this blog has been completed. For pandas to read the file, I believe you lot should have an in-depth understanding. In the process of reading files from pandas, the most mutual problem is the Chinese trouble and the format problem. I promise that when you encounter it, it can be solved perfectly.

If you have any questions, I hope I can reply to y'all in the comments section. I look forward to working with yous. Web log Garden - Dream Eraser

lebeauevering49.blogspot.com

Source: https://www.codestudyblog.com/cnb/0319174046.html

Lebeau Evering49